Apparently there are a lot of people that don’t understand what duplicate content is when publishing content online and/or when applied to search engine optimisation. The fine folk over at the Google Webmasters Blog have clarified some points but I’ll recap if you can’t be bothered in the jump.
Google define duplicate content as blocks of text replicated across multiple pages, either within the same domain or across multiple domains. The content does not need to be a direct copy, a similar but slightly different copy would also be considered duplicate content.
Google report that most duplicate content they encounter is served unintentionally. The most common culprits for duplicate content are alternate versions of the same page such as for printing or a mobile device. The other example they cited is an online store, listing the same item more than once on distinct URLs which are actually linked.
Fortunately, Google also advised what isn’t duplicate content. If you were to translate your page using the Google Translate service, then that wouldn’t be considered duplicate content. That does beg the question though, what if I personally translate an article into German, French and Spanish but it doesn’t match what Google translate produced; would that still be considered duplicate content?
What will Google do if they find duplicate content? In the vast majority of cases their preference is towards filtering the content, however they can/do also adjust their algorithms from time to time as well. If you’ve got two copies (normal, printer) available on your site and you aren’t blocking either one; Google will decided which one they think should be listed and not include the others. This is sub-optimal because you could find that Google chooses your printer version over your normal version, eck!
The Google Webmasters have also provided some helpful hints too:
- Blocking is the most effective mechanism for a webmaster. If you know you’ve got duplicate content and you don’t want to have the wrong versions listed, then block certain copies using a robots.txt file or a noindex meta tag.
- 301 redirects should be used if you’re restructuring your site. This item isn’t all that pertinent if you’re was a traditional static HTML site, where you would physically move files around on the server. It is however a good point if you’re using a content management system of some sort, where restructuring your site might actually leave the content on the old URL.
- Link consistently, don’t link to
/mypage
,/mypage/
and/mypage/index.html
– pick one method and stick to it. - Use top level domains to indicate regional content if you can. Google will have a better chance of knowing that .de represents a German version of your site over a de subdomain or /de/ in the URL. I wonder how Google handle it if you specifiy using meta tags that the page content is in German?
- Syndicate carefully by making sure that the syndicated content provides a link back to your original content.
- Don’t repeat yourself on common things like copyright or legal fluff, just provide a link to a page describing it in depth.
- Preferred domain feature in Google Webmaster Tools is useful.
- Don’t publish stubs such as pages on topic X with no actual content on them.
- Understand your content management product, so you’re aware of what/how it does what it does.
I’d like to know a little more about using Google Translate and translating your content in general. I also think there has to be a nicer way to handle regional sites than simply getting the regional domain as well; for a lot of people that simply isn’t possible.