Articles

The robots.txt file

by Salim Benhouhou IT Support
Please take note : This blog was written by Vodahost hosting company

In earlier posts we have looked at length about how to give your website the best possible exposure on Google as well as on the other search engines and we have looked at the best ways to SEO (Search Engine Optimise) your site. It has been a great deal of work on your part to make sure that your website is accessible to Google and its Googlebot, that there are plenty of keywords, plenty of quality links and a sitemap for it to follow. Today however we are not making your website more accessible to the Googlebot and the other search engine spiders. Quite the opposite...

Today we will be discussing the unthinkable; how to keep search engine spiders off your website or restrict them so they can only look at (or, index) parts of your website. It may feel strange to you to have done so much SEO work only to hide it or parts of it. In this article we will be looking at the anti-sitemap: the robots.txt file (or “Robot Exclusion Standard / Robots Exclusion Protocol” if you are a fan of particularly long phrases...).

GOOD BOTS
The robots.txt file is the opposite to your sitemap and exists to stop cooperating web spiders visiting all or part of your website (because it exists to tell them where they cannot go). It was started in the summer 1994 by agreement of the members of the robots mailing list because, quite simply, it seemed like a good idea. It was made more popular by Alta Vista, then the other big search engines caught on in the following years and started using the robots.txt standard too.

While it may seem that we are actually hurting ourselves by not letting web crawlers/ spiders/ robots look at our website in its entirety, this is actually not the case. There may be pages on your website that, while essential, do not actually help the SEO of your website. It might be a sales page that does not contain any of your keywords (maybe only: “Click Here To Confirm” or “Enter Your Credit Card Details”) and letting a robot look at those pages means a worse ranking on Google (more content; fewer keywords).

The information that you should be restricting using the robots.txt file is information that does not help in any way towards the SEO of your website, but we’ll discuss that again later.

So, let’s create a robots.txt file for your website...
It’s a simple plain text file (.txt), so we can create one using the most basic tools on your home computer. You should note that each domain should have it’s own robots.txt file and that includes sub-domains. Separate robots.txt file should be created for “yourwebsite.com” , “about.yourwebsite.com” as well as “waffles.yourwebsite.com”.

1)Open up a text editor...
For example: Notepad in Windows; TextEdit in Mac OSX

2) Start writing your robots.txt file...
Writing your robots.txt file is very straight forward. The first thing you do is specify which web crawler/ spider/ robot the text applies to. This is done using the “User-agent” statement. A “*” is a wildcard and it means EVERYBODY (all cooperating web crawlers/ spiders/ robots). You then make a “Disallow” statement telling the web crawler/ spider/ robot where it is not allowed to go.

As a result, the most simple form of the robots.txt file is as follows:

------
User-agent: *
Disallow: /
------

The above robots.txt file entry tells ALL cooperating web crawlers, spiders and robots to avoid ALL of your website. Obviously this is something you are never going to do... You can also do the exact opposite. The below robots.txt entry allows ALL cooperating web crawlers/ spiders/ robots to visit ALL of your website.

------
User-agent: *
Disallow:
------

Using the robots.txt you can keep cooperating away from specific files too as in the below example

------
User-agent: *
Disallow: /directory/file.html
------

Using the robots.txt files you can tell cooperating web crawlers/ spiders/ robots to stay away from one or several directories...

------
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
------

3) In this way, you can write more specific robots.txt documents...

In the below example, I want to keep the Googlebot out of my /images/ directory but I also want to keep Yahoo!’s bot out of the /videos/ directory. In addition I want to keep ALL cooperating bots out of my /cgi/ and /tmp/ directories. As a final stipulation, I also want VodaBot (okay, I made this one up) to stay away from an image file called pointless.jpg which is in my /images/ directory.

------
User-agent: Googlebot
Disallow: /images/

User-agent: yahoo
Disallow: /videos/

User-agent: *
Disallow: /cgi/
Dissallow: /tmp/

User-agent: VodaBot
Dissallow: /images/pointless.jpg
------

Finally, you will note that while the fictitious VodaBot cannot access the file pointless.jpg it can access the rest of my /images/ directory ... but what if I wanted it the other way round? What if I wanted the excellently named VodaBot to NOT be able to access anything in the /images/ directory EXCEPT an image file called “meaning-of-life.jpg”? Then I would use an Allow statement in my robots.txt file.

------
User-agent: VodaBot
Dissallow: /images/
Allow: /images/meaning-of-life.jpg
------

Note that Allow MUST come after a Dissallow statement

You should also be careful when using “/” as depending how you use it, it can mean different things. The following denotes a directory: “/images/” while “/images” (without “/” at the end) means any file in the root directory that begins with “images”. Writing: “Disallow: /images” does not limit access to the /images/ directory in any way, shape or form.

Have a look at wikipedia’s robots.txt file ( http://en.wikipedia.org/robots.txt ) as an example. It uses comments (the # symbol) to explain how their robots.txt file works. This is a great resource if you’re writing your first robots.txt file.

4) Save and upload...
Save your document in plain text format, as robots.txt, making sure that the extension of the text document is .txt. The file you have can be uploaded straight to the root (home) directory of the website it applies to.

BAD BOTS
The robots.txt file is a double-edged sword however. You will notice that I make reference to the “cooperating” spiders. Many people have the assumption that the robots.txt file can be used to hide parts of their website from the search engines. I cannot stress how wrong this is.

There is no official standards body for the robots.txt protocol and there are very, very many search engines out there on the Internet and each has its own crawler/ spider or robot... These must be programmed to follow the instructions laid out in your robots.txt document. Image if a crawler or spider was programmed to visit ONLY the links that the robots.txt told it not to visit. There is nothing to stop it doing this.

Any parts of your website that you do not want to be visible to anybody should:
(a) Not be uploaded to your website at all
(b) Be password protected

Of these two options, (a) is by far the most effective.

In general the robots.txt file is not there for security in any way. It is there to improve the Search Engine Optimization of your site to make sure all the hard work that you have done SEOing your website is used in the best and optimum way. It is there to stop Googlebot finding things that would hurt the SEO of your website or are pointless as far as the theme or content of your website goes.


About Salim Benhouhou Professional     IT Support

703 connections, 20 recommendations, 2,244 honor points.
Joined APSense since, August 10th, 2009, From bousaada, Algeria.

Created on Dec 31st 1969 19:00. Viewed 0 times.

Comments

Pierre Fricker Senior   Web Illiterate
Salim, Thanks for placing this article. I didn't know about the robots.txt file. Don't know where I could use right now, but in due time I might use it.
Jul 2nd 2011 07:17   
Barry Reeves Committed   Software Developer
Salim,
Excellent article on the robots.txt file. I've been using this file to block certain folders on my sites for years and you still pointed out a couple of things I was not familiar with. Thanks for sharing.
Barry
Jul 2nd 2011 07:36   
Salim Benhouhou Professional   IT Support
you are welcome
Jul 2nd 2011 09:39   
Mohann Krish Advanced   
In the 3rd point, I think Allow is used for GoogleBot only (and perhaps for the fictitious VodaBot too!). If the bots are NOT to be able to access anything in the /images/ directory EXCEPT an image file called “meaning-of-life.jpg, we can use: Disallow: /-images/meaning-of-life.jpg i.e., "images" directory with a 'minus' sign.
Jul 2nd 2011 09:54   
Philippe Moisan Magnate II   Tutorial videos, writing
Salim, please top Mohann's comment. When a comment adds info to an article, it deserves the "top it" treatment.
Jul 2nd 2011 10:03   
Philippe Moisan Magnate II   Tutorial videos, writing
I will repost the article on my blog, with a link back to here, and of course your name as the person who posted it first
Jul 2nd 2011 10:05   
Fred Mugone Magnate I   Health, Wellness, e-Business
Salim, a very enlightening article. Your articles always teach me tech stuff I was not aware of. With the kind of knowledge that you have, you may just end up being one of the admins over here at apsense in the near or not so far off future.
Jul 2nd 2011 10:46   
Salim Benhouhou Professional   IT Support
@philippe thank you
@ fred i don't think so there a lot of apsense members that are very good like philippe i think he deserves to be an apsense admin . but i still have a lot of work to do .
Jul 2nd 2011 10:50   
Philippe Moisan Magnate II   Tutorial videos, writing
LOL @salim if Wincer ever decided to make me admin, it would be WW III at APSense :)
Jul 2nd 2011 11:21   
Salim Benhouhou Professional   IT Support
i think you deserve to be an apsense admin
Jul 2nd 2011 11:23   
Philippe Moisan Magnate II   Tutorial videos, writing
Being known to hunt for scammers and spammers is not my cup of tea, Salim
Jul 2nd 2011 12:52   
Mohann Krish Advanced   
@Phil: That's not the only job of admins, I am sure! I think what Salim means is that an admin should do his/her job as well as be polite, friendly and helpful (like you). Goes without saying that decent language can be a great virtue. Without digressing further, I would say that Salim should come up with an article on Sitemaps too which would help all understand its importance.
Jul 2nd 2011 13:10   
Salim Benhouhou Professional   IT Support
@mohann yes that is right that is what i meant . and for sitemaps i will try
Jul 2nd 2011 14:26   
MONEY | Betting Gambling Trading Professional   FOLLOW4FOLLOW
Thanks, the necessary information. Has added in bookmarks)
Jul 4th 2011 20:17   
Philippe Moisan Magnate II   Tutorial videos, writing
Uh, $$$$, it's your choice, of course, but you're not branding yourself at all, cause there tens of thousands of members at APSense, and we all want to make $$$$$$$$$$$$$$
Jul 4th 2011 20:29   
Please sign in before you comment.