Your HTML

Posted by Tammy Muir
6
Jun 25, 2007
955 Views
Right the First Time:
The robots.txt file:

One often overlooked file that all Web sites should have is the robots.txt. While not a requirement it is a recommended file because ALL robots that index the Web are supposed to request this file before it does any indexing of any document found on your Web site. If this file is not found on your server when the robot visits it causes your Web server to return a 404 (Page Not Found) error or even a redirect to some other page. Robots are not suppose to follow a redirect or read a 404 (Page Not Found) error when requesting the robots.txt, but that doesn't mean they won't which can cause problems. Plus it makes little sense to have your server turn out a 404 (Page Not Found) error ever single time a robot visits the site. In one day Google requested this file from our server 18 times!

To prevent possible problems with robots the best thing you can do is provide this simple robots.txt file.

The Standard for Robots Exclusion (SRE) was first proposed in 1994, as a mechanism for keeping robots out of unwanted areas of your Web site. These areas may include the following:

sensitive "for your eyes only" material
infinite URL spaces in which robots could get trapped ("black holes")
resource intensive URL spaces, e.g. dynamically generated pages.
documents which would attract unmanageable traffic, e.g. erotic material.
documents which could represent a site unfavorably, e.g. bug archives.
documents which aren't useful for world-wide indexing, e.g. local information.
Even if you do not wish to restrict robots from specific areas or files on your server it is still recommended you provide this file and simply instruct the robot that it is OK to index what ever it finds.

WARNING UPDATE!!! (December 11, 2003)

There is a major flaw in Google's robot when it comes to the robots.txt file so you MUST follow these instructions or your home page could be dropped from Google's index.

The construction of the robots.txt file is actually very simple. To create a robots.txt file all you need is a simple ASCII text editor like Notepad (NOT Wordpad) and create a text file called robots.txt. Here are a couple of examples of what you might place in this file:

To allow all robots complete access
User-agent: *
Disallow: /azr94v2hh2lg/

Some say you can create an empty robots.txt file instead of the above or to leave the "Disallow:" area empty, but if you do you could risk being dropped from Google. We have confirmed at least one member who has fallen victom to this so please follow these instructions.

So why did we enter "/azr94v2hh2lg/" for the Disallow: area? This tells Google as well as any other robot type search engine to stay out of your "/azr94v2hh2lg/" directory. Of course you do not have this directory available and that is our point. You can replace the "/azr94v2hh2lg/" with any other directory name you wish just as long as the directory does not and will never be part of your Web site. You have been warned....

If you would like a copy of the robots.txt above click here and save it to your local computer. Then upload this file to your Web server as instructed below (see Where does the robots.txt file go?: for more information).

To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/

Where does the robots.txt file go?:
The robots.txt is to be placed in your root "/" HTML document tree. Example;

Correct:
http://www.somesite.com/robots.txt

NOT Correct:
http://www.somesite.com/some_folder/robots.txt

PLEASE NOTE!!! The filename robots.txt is case sensitive and MUST be in lowercase. If you name this file RObOTS.txt or anything other than robots.txt it will be incorrect and will NOT WORK.

Uploading the robots.txt to your Web server:

Unfortunately there are hundreds if not thousands of different programs you can use to upload files to your Web server. Of course the easiest and best way in our opinion is to use a specialized program called FTP (File Transfer Protocol). You might want to ask your hosting provider to assist you in selecting an FTP program. FTP programs allow you to select the file(s) you wish to upload and then just drag it to the location on your Web server you want the file to located at.

We realize however that many of you are using HTML editors that try to do all this for you (which usually only confuses things for you). Since there are so many programs out there and different operating systems, the only thing we can do is provide a few links to some of the most popular editors and Web hosting sites:


Help with FrontPage. Or how about step-by-step instructions we posted in the Discussion Forum.
Help with Yahoo's Easy Upload PageBuilder and SiteBuilder. Or you can try this link.
Help with Macromedia's Dreamweaver. Download their PDF documentation.
Help with Macromedia's HomeSite. Download their documentation.
Help with Fookes Software's NoteTab.
FTP Help for several softwares including IE and Netscape browsers! Here.
For help with other software or Web hosting sites we suggest you review their help files or contact the vendor directly and ask for their assistance.

For more information on the Standard for Robots Exclusion (SRE) visit The Web Robots Pages (be careful of the Google flaw and do NOT follow their instructions on the empty Disallow: option in the robots.txt file).

Robot Friendly HTML:

All web pages are made up of HTML (HyperText Markup Language) code. Make sure the HTML code is "robot friendly" and the proper Tags are in the proper places. Most robot type search engines do not tolerate invalid HTML code, that is, if your code contains invalid HTML Tags, most search engines will not index your Web page or maybe even worse, not index you correctly. This guide and our Web Page Analyzer will help you create "robot friendly" Web pages, but first you should understand the basics.

Every Web page MUST, at a minimum, contain the following HTML Tags (highlighted in RED):




Your Web Page Title


The viewable content of your Web page.



The total number of characters between your and Tags should NOT contain more than 60 characters. More on this later.
The above represents the absolute minimum HTML Tags for any and all Web pages. If your Web pages do not contain these Tags, you better go back to the drawing board before submitting to the robot type search engines. While the above is the minimum, it doesn't mean the robot type search engines will index you properly. The only way to do this is by the use of META Tags.

Besides your HTML you have to consider the URL of the page you are submitting.

Dynamic pages may block Web crawlers!

While it's great to give visitors unique experiences, tailored to their needs, the techniques you use to do that could stop search engines from indexing your content and hence could greatly reduce your potential traffic. Dynamically generated pages are created on the fly from a variety of elements held in databases. Typically such pages have a question mark (?) in the URL. When a search engine crawler arrives at a URL that contains the question mark, it could halt immediately, and won't be able to or follow the links, because it believes an infinite number of pages are ahead -- a black hole that would bring it to a crash.

Make sure you have a home page, or the equivalent of one. You need a page that serves as a navigation hub for the static pages at your site, to which other pages can point. As a rule of thumb, it shouldn't take more than three clicks to go from anywhere at a Web site to any other page at that same site.

By the way, this is one reason why nobody can say how many pages there are on the Web, total. Every dynamic site has potentially an infinite number of pages. And how many millions of dynamic sites are out there?

One Self Defense Product is Better Then 2 Thugs!

2 people like it
avatar
Comments (1)
avatar
Jesse Jarvis
1

avatar
Please sign in to add comment.