Upload
rajesh-magar
View
348
Download
2
Tags:
Embed Size (px)
Citation preview
Controlling Search Engine Crawlers
For Better Indexation and Rankings -
Inspired from MOZ
Controlling Crawling and Indexing using 1. Robots.txt
2. Meta Robot tags• NoFollow
Tag3. X-Robots-Tag HTTP header
Some sample robots.txt files
1. Allow crawling of all content
User-agent: * Disallow: orUser-agent: * Allow: /
Disallow crawling of the whole website
User-agent: * Disallow: / Disallow crawling of certain parts of the website
User-agent: *Disallow: /calendar/ Disallow: /junk/
Allowing access to a single crawler
User-agent: Googlebot-newsDisallow:User-agent: *Disallow: / Allowing access to all but a single crawler
User-agent: UnnecessarybotDisallow: /User-agent: *Disallow:
2. Using the robots meta tagPage-specific approach to controlling how an individual
page should be indexed and served to users in search results.
<meta name="robots" content ="all"><meta name="robots" content="noindex, nofollow"><meta name="robots" content="index, nofollow"><meta name="robots" content="noindex, follow">
<meta name="GOOGLEBOT" CONTENT="all"><meta name="GOOGLEBOT" content="noindex, nofollow"><meta name="GOOGLEBOT" content="index, nofollow"><meta name="GOOGLEBOT" content="noindex, follow">
3. X-Robots-Tag HTTP headerHTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: noindex(…)
_________________________________HTTP/1.1 200 OKDate: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: noarchiveX-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST(…)_________________________________
HTTP/1.1 200 OKDate: Tue, 25 May 2010 21:42:43 GMT(…)X-Robots-Tag: googlebot: nofollowX-Robots-Tag: otherbot: noindex, nofollow(…)
<Files ~ "\.(png|jpe?g|gif)$"> Header set X-Robots-Tag "noindex"</Files>
<Files ~ "\.pdf$"> Header set X-Robots-Tag "noindex, nofollow"</Files>
Practical implementation of X-Robots-Tag with Apache
Best Practices1. Content that isn't ready yet
A. Large Quantity? : Robots.txtB. Small Quantity? : Use Meta Tag
2. Dealing with duplicate or thin contentA. Probably use : rel=canonnicalB. Disallow if crawl budget is an issue using Robots.txt file
3. Passing link equity without appearing in search results
A. Meta Robots : NoIndex, FollowB. Do not Disallow in Robots.txt
4. Search results-type pagesA. Make the most common/popular one into category-style or Landing page with unique valueB. Disallow in Robots.txt (If you sure about the post-action)
Which brought us to the Information overload of
Source: http://www.google.com/insidesearch/howsearchworks/thestory/index.html
YES INDEED….
"The internet is much, much bigger than people think,"
Still, 95% of the web is completely invisible for your be loving
Google, Yahoo, Bing. In-fact any search engine is available on the planet.
Sounds hard to believe…But that’s
what….what it is!
Source: http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/
Friendly References: 1. https://developers.google.com/webmasters/control-crawl-ind
ex/2. https://developers.google.com/webmasters/control-crawl-ind
ex/docs/robots_txt3. https://support.google.com/webmasters/answer/10619434. https://moz.com/blog/controlling-search-engine-crawlers-for-
better-indexation-and-rankings-whiteboard-friday
And most important
5.http://www.cbsnews.com/news/new-search-engine-exposes-the-dark-web/Big Thanks : https://moz.com/blog/controlling-search-engine-crawlers-for-better-indexation-and-rankings-whiteboard-friday