A robots.txt file is a list of parts of a website that a web robot should or should not have access to. Website owners can use robots.txt
to control automated page requests by web robots or web crawlers.
By default, the Rigor Content Check obeys robots.txt
when it crawls a site to check for link health:
When this setting is enabled, Rigor Content Checks will not visit any URL that is disallowed by robots.txt
.
Web crawlers are not required to obey rules in robots.txt
, so Rigor gives users the option to configure Content Checks to disobey.
Should I disable ‘Obey Robots.txt’?
A site’s robots.txt file could prevent Rigor’s Content Check from crawling an entire site. If the starting URL for the Content Check is disallowed by robots.txt
this could keep the Content Check from running altogether.
If robots.txt
is blocking or partially blocking a Content Check you can:
- Edit
robots.txt
and add an allowance to let Rigor crawl the site (Recommended)
User-agent: Rigor Allow: /
- Uncheck ‘Obey Robots.txt’ on the Advanced tab when creating or editing a Content Check to allow Rigor to ignore rules set in the robots.txt file