Search engine conundrum
Search Engines are a confusing breed of web application. They exist in a limbo state between the fully automated computer land and the human driven application. When you create a site you have to add it to the search engnines to get them to ‘crawl’ your site, and from then on you have lost control of what it finds when it finds it or even IF it finds your site at all…
How to make your site invisible by using robots!
Some time in the dim and distant past of the web ( about 1997 :p ) a draft specification was written that stopped web crawlers from crawling a site. This allowed the owner to hide pages, and now allows you to hide entire sites.
This special file is very simple and is called robots.txt. This text file hides at the root of your website, for example www.theinevitabletruth.co.uk/robots.txt
A file that told all the web crawlers to go away would look like this….
User-agent: *
Disallow: /
Where the User-agent: could be followed by google or msn
The Disallow: could be followed by nothing or a specific file/directory that you want to hide!
How to make it more agent friendly
Computers are not very good at reading natural languages they prefer more structure and as such to help your crawly friends you have to write something especially for them.
The best example is a Sitemap (the S is deliberate it denotes the use of the Sitemap xml based protocol )
The simplest example of a Sitemap file is a plain text file containing a list of web addresses one per line, for example I could put this in a plain text file called sitemap.txt and locate it at http://www.theinevitabletruth.co.uk/sitemap.txt
http://www.theinevitabletruth.co.uk/
http://www.theinevitabletruth.co.uk/whatsmyip
And then add a line to my robots.txt that describes my Sitemap
Sitemap: http://www.theinevitabletruth.co.uk/sitemap.txt
If you think of any more hints and tips then please comment on them.