Love them, hate them or fear them, the major search engines are some of the most powerful tools at our disposal. When it stopped publishing the number on 2005, Google had over 8 billion pages indexed in its database. It would simply not be possible to navigate the internet at that size without a comprehensive and accurate search tool. How do they work? Every search engine works slightly differently, but they have a basic structure in common. They have maintain a large store of information on all the websites that they find. They store some images of the pages themselves and they classify the site in terms of the type of information that it contains and a measure of how useful the page would be for someone searching for information on a certain topic. This store of information is called the index. The index is fed information from the web using a software tool which has several names. You will see references to a web crawler, spider, or webbot and they all mean essentially the same thing. This is a tool that moves across the web, visiting web pages and reading their content in order to populate the search engine index. If you have a site that is not in a search engine index there are several ways of getting the webbot to visit. You can go to the search engine and manually enter your web address and request a crawl. This is not always an effective way of getting a visit. The other way is to build links with existing sites. When the webbot next visits the existing site they will follow the link to your site and discover it. There are in fact several parts to webbots. The Googlebot has two main components. Freshbot is the part of the crawler that finds new sites. Another component, Deepbot will then revisit the new site and do a more in depth analysis of the site. Don't be surprised if only a small part of your site is initially listed in the index. A deeper crawl will find the rest of your pages along as you have structures your site correctly. If you have information on your site that you do not wish the search engine to index, it is relatively straight-forward to indicate to the webbot where you want them not to look. Your web master can add a command to your HTML to keep them away from sensitive data. Of course search engines do not guess passwords, so that is another way of keeping them out. When you are building your site, you must consider how your pages look to webbots in the same way that you do for your human visitors. Bots cannot cannot read images for example, so any data in an image will be missed by them. Many sites use Java to generate menus. The bot will miss all the links in that menu because it is an image, not text. Similarly, drop-down menus are very useful for humans, but are invisible to bots. The crawler will navigate your site using the links that it finds. Having a clear structure is an important part of web design for this reason. If you really want to avoid any linking problems, you can present the search engine with an XML file called a sitemap, which will guide ir around your site. Crawlers have none of the human abilities to read a page and quickly determine what it is about, and if it is useful. Search Engine Optimisation (SEO) is concerned with presenting pages, so that it is clear to the crawler what the page is about and that it is worth ranking well in the index. This is done by a thorough understanding of how the webbots read pages and also an up-to-date picture of what specific search engines are looking for when they are analysing the quality of the content on a site. As with any new and rapidly changing technology where there are fortunes to be made and lost, SEO attracts some of the best and some of the worst people involved in the internet. Be wary of anyone who cannot show you a clear track record, and avoid anyone who offers you a guarantee of a top ranking in a short time. |