Category Archives: Internet

Google, Yahoo! & Microsoft Collaborate For The Greater Good

Google, Yahoo! and Microsoft have collaborated for the greater good and are all going to support a standardised XML sitemap protocol.

Google was the first to implement the XML sitemaps in June 2005 as a beta product. After a few months of public testing, the beta tag was removed and the service was ready for general consumption. Since that time, Google Sitemaps gained a significant amount of momentum.

Its wonderful to see that Yahoo! and Microsoft didn’t implement another format specific to their own search engines and have collaborated with Google. With the standardised XML format, its now possible for content publishers to feed Google, Yahoo! and Microsoft search engines off the same physical file. That point alone is a huge bonus for the publishing community though I think even more significant is that the standardised format will have the same semantic meaning to all search engines as well.

Google Malware Warning

Google intergrate malware blocking services into their search servicesI was recently searching for information using Google and was suprised to see that they have integrated a malware blocking service into their search results.

As Google are indexing the internet, they are always taking into consideration what content is on your site. They are now using that information to warn their users of a web site which might contain suspect or malicious content.

Personally, I think that it is an excellent service to provide the Google user base. A lot of people who simply ‘use’ computers aren’t aware how easy it is for their computer to become infected with all sorts of nasty stuff. At least if they are confronted with an intermediary page as listed above, it will make the users think twice about viewing the site or using any of the content/services which they may provide.

Wired Purchases Reddit

It seems that in recent times, aquiring cool community driven sites has become the new black. In the latest round of bringing black back into vogue, Wired have stepped up to the plate and have sucessfully struck a deal with Reddit.

For those that aren’t aware, Reddit is a social news site. If you’re asking yourself what that means, its simple; instead of the site owners deciding what is newsworthy – the users of the site submit the news items. To keep the news relevant and fresh, other users can choose to vote a news item up and down in importance. The higher the number of votes a news item receives in a given period, the higher the item will float until it ultimately reaches the #1 story on the site.

Reddit was initially funded through Y!Combinator and to the delight of Paul Graham, it was written in Lisp. Not too long after the site was launched, the owners of the site controversially rewrote Reddit in Python to the horror of the Lisp fanatics. It raised such a noise online, soon other Lisp fanatics were posting articles on how to (re)-write Reddit in 100 lines of Lisp.

The question now is, who is next? Rumors have been flying rampant about Digg.com and it was confirmed that certain parties were interested but that they couldn’t come to a reasonable (read: less than USD$150M) dollar value.

What Is A Search Engine?

A search engine is typically a software application which is capable of receiving requests and returning results based on a simple human readable text phrase or query. The query is received and the search engine then evaluates the request and attempts to find the most relevant results from its database. The relevancy of the results returned is based on complex rules and algorithms which rank each unique resource in its database. The results of a search request are typically sorted in descending order based on relevance to the search query.

Search Engine Types

There are three main types of search engines:

  1. human generated
  2. automated or algorithm based
  3. a hybrid of the two previous options

A human generated search engine is what is now generally considered a directory. Users submit their sites and the administrators of the directory review and include sites on their discretion. If a particular web site is included into the directory, it is evaluated, categorised and subsequently placed within the directory. The most widely known human generated search engine in existence today is the Open Directory (dmoz.org).

An automated or algorithm based search engine does not rely on humans to provide information for searches to take place on. Instead, an algorithm based search engine relies on other computer programs, known as web crawlers or spiders to provide the data. Once the web crawlers or spiders have provided the data, separate computer programs evaluate and categorise the web sites into the directory.

The hybrid search engines combine both human generated and an algorithm based approach to increase the quality of the search data. In these systems, the internet is crawled and indexed like an automated approach; however the information is reviewed and updated as this process takes place.

Search Engine Strengths & Weaknesses

Each technique described above has its own strength and weaknesses. In a directory style search engine, the quality of the results is often very high due to a physical person reviewing the content on the web site and subsequently taking the appropriate actions. Unfortunately, due to the ever increasing number of web sites and content on the internet, requiring human intervention to rank and categorise a web site doesn’t scale.

In a purely automated approach, the search engines rely on the speed of software applications to index the internet. While the human based approach might allow for tens or possibly hundreds of pages to be categorised simultaneously; a search engine spider or crawler is capable of doing thousands or millions of pages simultaneously. The obvious problem with this approach is that since the search engines rely on algorithms, the algorithms can be exploited. In years gone past, “webmasters” cottoned onto how these type of search engines worked and started abusing the system by including keywords into their site which had nothing to do with the primary focus of the page or domain. The search engine would spider the site and suddenly an online shoe shop is coming up in searches for porn, drugs, gambling and more.

The hybrid based approach attempts to resolve the two aforementioned issues by crawling the internet using software applications and reviewing the results. The algorithms which rate and categorise a particular web site are tuned appropriately over time and the results they produce are monitored very closely to ensure accuracy of the search results. Companies which implement a hybrid based approach have teams of people whose soul purpose is to review the validity of various search results. If they find results which they would consider to be out of place, they are marked for investigation. If the results they expect do not come up, that is also noted down and sites can be manually included into the search index.

Now that you know what a search engine is, keep your eyes peeled for a follow up on how search engines work.

Search Engine Optimisation (SEO): Demystifying The Black Art

Search Engine Optimisation (SEO) is a black art to a lot of people. Anyone publishing content online that is looking for a lot of exposure for their product, service or general announcement really need to know about it. The unfortunate reality is that most people don’t know about SEO and those that do, know that they should do it but don’t really know what it is.

In the coming weeks, I’m going to release a series of short, simple to understand posts about what search engine optimisation is and how it roughly works. Through these simple posts, I’m hoping to demystify the black art of search engine optimisation a little so that the less savvy content publishers can understand it and start taking advantage of it.

Below is a short list of some of the items that are going to be covered:

  • HTML <title> element
  • HTML heading (<hX>) elements
  • Content
  • URL nomenclature
  • Keywords
  • HTML <meta> data
  • HTML markup
  • Images
  • Inbound links
  • Outbound links
  • Intersite links