Thursday, May 24, 2012

How to control access of the web crawlers or web robots to your site

There are numerous reasons as to why or when you should control the access of the web robots or web crawlers to your site.  As much as you want Googlebot to come to you site, you don’t want the spam bots to come and collect private information from your site. Not to mention that when a robot crawls your site it uses the website’s bandwidth too! In this post I have explained how you can control the access of the web robots to your site through the usage of a simple ‘robots.txt’ file.

What are web robots or web spiders?
Web Robots (also known as bots, web spiders, web crawlers, Ants) are programs that traverses the World Wide Web in an automated manner. Search engines (like Google, Yahoo etc.) use web crawlers to index the web pages to provide up to date data.


Why use ‘robots.txt’ file?
Gooble bot may be crawling your site to provide better search results but at the same time other spam bots may be collecting personal information such as email addresses for spamming purpose. If you want to control the access of the web crawlers on your site, you can do so by using the “robots.txt” file.

How do I create ‘robots.txt’ file?
‘robots.txt’ is a plain text file. Use any text editor to create the ‘robots.txt’ file.
‘robots.txt’ file format
The entries (rules) in the robots.txt file are entered in a ‘field’ ‘value’ pair.
<field>:<value>
A simple robots.txt file uses the following three fields:
User-agent: the web robot the following rule applies to.
Disallow: the URL you want to block the robot from accessing.
Allow: the URL you want to allow the robot to access.

Examples

The following will stop all robots from crawling your site (‘*’ means all and ‘/’ is the root directory.)

User-agent: *
Disallow: /

The following will stop all robots from crawling the ‘/private’ directory.

User-agent: *
Disallow: /private

Stops Googlebot from indexing your images for Google image search. Use this to save bandwidth if u don’t want your images to be available for Google image search. Read the Reduce Bandwidth Usage post to learn more.

User-agent: Googlebot-Image
Disallow: /

The following will block all robots from crawling your site except Googlebot

User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /

Where to put the robots.txt file?

Put the robots.txt file in the root directory of your website. For example, put the file in the www.yoursite.com not in a sub-directory like www.yoursite.com/sub-directory. In most cases it will be the “public_html” directory of your site.

No comments:

Post a Comment