A search engine is typically a software application which is capable of receiving requests and returning results based on a simple human readable text phrase or query. The query is received and the search engine then evaluates the request and attempts to find the most relevant results from its database. The relevancy of the results returned is based on complex rules and algorithms which rank each unique resource in its database. The results of a search request are typically sorted in descending order based on relevance to the search query.
Search Engine Types
There are three main types of search engines:
- human generated
- automated or algorithm based
- a hybrid of the two previous options
A human generated search engine is what is now generally considered a directory. Users submit their sites and the administrators of the directory review and include sites on their discretion. If a particular web site is included into the directory, it is evaluated, categorised and subsequently placed within the directory. The most widely known human generated search engine in existence today is the Open Directory (dmoz.org).
An automated or algorithm based search engine does not rely on humans to provide information for searches to take place on. Instead, an algorithm based search engine relies on other computer programs, known as web crawlers or spiders to provide the data. Once the web crawlers or spiders have provided the data, separate computer programs evaluate and categorise the web sites into the directory.
The hybrid search engines combine both human generated and an algorithm based approach to increase the quality of the search data. In these systems, the internet is crawled and indexed like an automated approach; however the information is reviewed and updated as this process takes place.
Search Engine Strengths & Weaknesses
Each technique described above has its own strength and weaknesses. In a directory style search engine, the quality of the results is often very high due to a physical person reviewing the content on the web site and subsequently taking the appropriate actions. Unfortunately, due to the ever increasing number of web sites and content on the internet, requiring human intervention to rank and categorise a web site doesn’t scale.
In a purely automated approach, the search engines rely on the speed of software applications to index the internet. While the human based approach might allow for tens or possibly hundreds of pages to be categorised simultaneously; a search engine spider or crawler is capable of doing thousands or millions of pages simultaneously. The obvious problem with this approach is that since the search engines rely on algorithms, the algorithms can be exploited. In years gone past, “webmasters” cottoned onto how these type of search engines worked and started abusing the system by including keywords into their site which had nothing to do with the primary focus of the page or domain. The search engine would spider the site and suddenly an online shoe shop is coming up in searches for porn, drugs, gambling and more.
The hybrid based approach attempts to resolve the two aforementioned issues by crawling the internet using software applications and reviewing the results. The algorithms which rate and categorise a particular web site are tuned appropriately over time and the results they produce are monitored very closely to ensure accuracy of the search results. Companies which implement a hybrid based approach have teams of people whose soul purpose is to review the validity of various search results. If they find results which they would consider to be out of place, they are marked for investigation. If the results they expect do not come up, that is also noted down and sites can be manually included into the search index.
Now that you know what a search engine is, keep your eyes peeled for a follow up on how search engines work.