Robots.txt - 多个用户代理的抓取延迟的正确格式是什么？-Java 学习之路

下面是一个示例 robots.txt 文件，允许 multiple user agents 与 multiple crawl delays 为每个用户代理 . 爬网延迟值仅用于说明目的，并且在真实的robots.txt文件中会有所不同 .

我已经在网上搜索了正确的答案，但找不到一个 . 有太多混合的建议，我不知道哪个是正确/正确的方法 .

Questions:

（1）每个用户代理都可以拥有自己的抓取延迟吗？（我假设是的）

（2）在Allow / Dissallow行之前或之后，您在哪里为每个用户代理放置了爬行延迟行？

（3）每个用户代理组之间是否必须有空白 .

参考文献：

http://www.seopt.com/2013/01/robots-text-file/

http://help.yandex.com/webmaster/?id=1113851#1113858

基本上，我希望找到最终robots.txt文件应该如何使用下面示例中的值 .

提前致谢 .

# Allow only major search spiders    
User-agent: Mediapartners-Google
Disallow:
Crawl-delay: 11

User-agent: Googlebot
Disallow:
Crawl-delay: 12

User-agent: Adsbot-Google
Disallow:
Crawl-delay: 13

User-agent: Googlebot-Image
Disallow:
Crawl-delay: 14

User-agent: Googlebot-Mobile
Disallow:
Crawl-delay: 15

User-agent: MSNBot
Disallow:
Crawl-delay: 16

User-agent: bingbot
Disallow:
Crawl-delay: 17

User-agent: Slurp
Disallow:
Crawl-delay: 18

User-agent: Yahoo! Slurp
Disallow:
Crawl-delay: 19

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

（4）如果我想将所有用户代理设置为10秒的爬行延迟，那么以下是否正确？

# Allow only major search spiders
User-agent: *
Crawl-delay: 10

User-agent: Mediapartners-Google
Disallow:

User-agent: Googlebot
Disallow:

User-agent: Adsbot-Google
Disallow:

User-agent: Googlebot-Image
Disallow:

User-agent: Googlebot-Mobile
Disallow:

User-agent: MSNBot
Disallow:

User-agent: bingbot
Disallow:

User-agent: Slurp
Disallow:

User-agent: Yahoo! Slurp
Disallow:

# Block all other spiders
User-agent: *
Disallow: /

# Block Directories for all spiders
User-agent: *
Disallow: /ads/
Disallow: /cgi-bin/
Disallow: /scripts/

1 回答

20

（1）每个用户代理都可以拥有自己的抓取延迟吗？

是 . 由一个或多个 User-agent 行开始的每条记录都可以有 Crawl-delay 行 . 请注意 Crawl-delay 不是original robots.txt specification的一部分 . 但是将它们包含在那些理解它的解析器中是没有问题的，因为规范defines：

忽略无法识别的标头 .

因此，较旧的robots.txt解析器将忽略您的 Crawl-delay 行 .

（2）在Allow / Dissallow行之前或之后，您在哪里为每个用户代理放置爬行延迟行？

无所谓 .

（3）每个用户代理组之间是否必须有空白 .

是 . 记录必须由一个或多个新行分隔 . 见original spec：

该文件由一个或多个由一个或多个空行分隔的记录组成（以CR，CR / NL或NL终止） .

（4）如果我想将所有用户代理设置为10秒的爬网延迟，那么以下是否正确？

不，Bots会查找与其用户代理匹配的记录 . 只有当他们找不到记录时，他们才会使用 User-agent: * 记录 . 因此，在您的示例中，所有列出的机器人（如 Googlebot ， MSNBot ， Yahoo! Slurp 等）将没有 Crawl-delay .

另请注意，您不能拥有several records with User-agent: *：

如果值为“*”，则记录描述任何未匹配任何其他记录的机器人的默认访问策略 . 不允许在“/robots.txt”文件中包含多个此类记录 .

因此解析器可能会查找（如果没有其他记录匹配）第一个记录 User-agent: * 并忽略以下记录 . 对于您的第一个示例，这意味着不会阻止以 /ads/ ， /cgi-bin/ 和 /scripts/ 开头的网址 .

即使你只有一个 User-agent: * 的记录，那些 Disallow 行只适用于没有其他记录匹配的机器人！正如您的评论 # Block Directories for all spiders 建议的那样，您希望为所有蜘蛛阻止这些URL路径，因此您必须为每条记录重复 Disallow 行 .

回复于 2024-06-02T16:12:28+08:00

Robots.txt - 多个用户代理的抓取延迟的正确格式是什么？

1 回答

相关问题