Protect Your Site's Content by Keeping the Scraping Bots Away

Posted by Carly Jonesla
1
Sep 13, 2013
725 Views

Every website on the internet today is regularly visited, or crawled, by computer software known as spiders, robots, or bots. While these crawlers are important for certain aspects of the internet, like helping to detect new sites and determine search engine rankings, there are also a lot of negative aspects to crawlers. Some of these spiders and bots are out there to steal your content. Most webmasters have experienced stolen content at some point, but the good news is there are many things that can be done to help prevent this. If you have a website or are planning to create one, it is important to block bad bots from gaining access to your content.

 

The Danger of Bad Bots

 

Many web users don't understand how to read the analytic data of their websites. Many analytic tools will track all hits a site receives, including all of the spiders and bots. Most websites received hundreds to thousands of hits per day from these spiders and bots, which often gives inexperienced website owners a false impression of the popularity of their site. These same site owners also don't understand that some of these bots are bad and that they need protection from scraping. All such bad bots needs to be blocked in order to save the main site content.

 

Many of such crawlers are designed for the sole purpose of scraping websites for content that may be of interest to another site owner. These bots will actually go through a site and steal that content, and then the content will appear on another site. There are many downsides to this content theft. If the same content appears on two sites, the search engines will recognize this as duplicate content. This could result in a penalty from the search engines that will ultimately reduce your web traffic. The presence of duplicate content can also reduce the credibility of your site. If visitors see the same images and text on your site that they saw elsewhere, they won't think you have as much authority as you would have if you had unique content. Therefore, it is important to block bad bots and make sure they don't take your content.

 

How to Detect Bad Bots

 

In order to prevent bad bots from scraping your site, you first need to know how to detect them. Some analytic tools will tell you if a hit is coming from a crawler or a spider, but they won't necessarily tell you which ones are good or bad. The three main factors to consider when detecting bad bots are the IP address, the requested URL, and the User Agent. If you analyze these three factors, you can figure out which bots are bad and find a way to stop them. Here are three steps you can take using these factors to determine if a bot is bad:


·     Check to see if a client/spider is following a blank URL.

·     If the URL is blank, get that IP address and perform a reverse DNS lookup.

·     If the DNS lookup points to an untrusted domain, chances are this is a bad bot.

 

Once you have analyzed the bots and figured out which ones are bad, it is now time to prevent them from scraping your site.

 

Keeping Them Away

 

Protection from scraping by bad bots can be accomplished in a variety of ways. There are some tools that can help you detect when your content is duplicated, and there are other methods of making it harder to steal your content. However, the best solution is to prevent the bots from visiting your site at all. To do this, you will need access to your website's code, and you should have some familiarity with coding. If you don't, ask for help from a webmaster.

 

In order to block these bad bots, you will need to access the .htaccess file. This file can override the configuration settings of your site, and it can also, as its name indicates, determine who has access to portions of your site. In your .htaccess file, you will actually create a "ban" list using the IP addresses of the bots you have determined to be untrusted. This will prevent those bots from being able to see your site, and ultimately from stealing your content. This is all a relatively simple yet tedious process, but it is very helpful in protecting your online content and keeping the bad bots away. This is one very reliable method of keeping the bots at bay, but there are other ways to do it as well. The important thing is to find a way that works for you as a website owner.

 

All websites are subjected to hundreds and hundreds of crawlers per day. While many of these spiders and bots are necessary for important things like determining search engine rankings, there are also many bad bots out there that are trying to steal your content. It's important to block bad bots from accessing your site because content theft can lead to a loss in your web traffic. By blocking bad bots, you can help protect your content and keep your site a credible and popular resource for your intended audience. 

Comments
avatar
Please sign in to add comment.