Block bad bots from your site
Jul 10th, 2008 by Richard
Have you ever wondered how that email address listed on your web site gets onto those spam lists or how copies of your web site content ends up in places that you don’t want it to?
Well, many times the reason this happens is because bots and crawlers are spidering your web site data for nefarious reasons.
But you can block a lot of this from happening by simply adding a list of the bad bots and crawlers to your web site using a robots.txt file.
I’ve managed to block most of the bad bots as you can see from the chart shown in this post. Only bots you want to visit like google, yahoo, bloglines etc are spidering my site.
When a bot or crawler comes to your site, they normally check first to see if you have a robots.txt file and then check to see if they are listed in the file. If they are listed using the disallow setting, then the bots will go away and not spider your site.
I block about 100 bots from this site. Have a look at my robots.txt for the full list and feel free to copy my list of bad bots and use it for yourself.
To find out how to use your robots.txt file to do all sorts of other things, please check out the official Web Robots Page for more suggestions.
NOTE: Be sure to put your robots.txt file in your web space root because that is where the bots look for it. If you place the file anywhere else, they will not read it, or follow it.
Great advice Richard and also great work on the list of bots in your robots.txt.
Since “bad bots” or even “evil bots” that steal content will ignore your robots.txt file and often pretend to be a normal webbrowser, you may want to consider to also block the bad bots directly in your apache configuration either by host name, IP address or User Agent.
[...] is an interesting resource with a list of suggestions for bots to block: Block bad bots from your site __________________ Joe Ward / Crawlability Inc. vBSEO 3.2.0 Launched - Maximum Overdrive for [...]