At addition time
Pages skipped at addition time are not added to the URL database. This can happen for two reasons:
- The URL doesn’t match at least one of your actions’
pathsToMatch
. - The URL matches one of your crawler’s
exclusionPatterns
.
At retrieval time
Pages skipped at retrieval time are added to the URL database, retrieved, but not processed. Those are flagged “Ignored” in the Crawler dashboard. This can happen for these reasons:
- The
robots.txt
check didn’t allow this URL to be crawled. - The URL is a redirect. The crawler adds the redirect target to the URL database but skips the rest of the page.
- The page’s HTTP status code is not 200.
- The media type isn’t one of the expected ones.
- The page contains a canonical link. The crawler adds the canonical target as a page to crawl according to the same addition-time filters and then skip the current page.
- The robots meta tag in the HTML is set to
noindex
ornone
. For example,<meta name="robots" content="noindex"/>
. You can override this behavior by setting the ignoreNoIndex parameter totrue
.