Are you curious about exactly what the guidelines are when it comes to duplicate content and how you can keep from having your sites unfairly penalized by Google? While there are sites that attempt to “fool the search engines” using duplicated content, there are also plenty of businesses that have substantial content blocks either across URLs or across domains that are similar. This article will help you to build a strategy for managing similar content and pages to keep from being penalized for duplicate content.
Most legitimate duplicate content occurs when a company creates the same type of content to describe a product page for the same product an which is accessible through various URLs. For example:
Other uses of duplicate content might include printer only or PDF versions of websites, which contain the same content as the regular site, but for the purpose of printing the information. All of these will cause the Google spider to index the pages as containing duplicate content which can hurt the site’s page rank or even cause the pages to be removed from the Google search results.
Thankfully, changes in the site configuration, known as ”canonization,” can be used to inform the Google spiders of which URLs are more important and can keep your site from being penalized for duplicate content. This will also empower you to inform Google of how you would like your web pages to be indexed.
Parameter Handling is a method that informs Google about which URL parameters to ignore, which will keep their spiders from recognizing parts of your site as containing duplicate content. For example, using the Google Webmaster tools you can suggest 15 “parameters” on your site which you would like the Google spiders to ignore, and if you suggest the parameter “products” as one of these ignored parameters then:
Would simply be recognized by Google as:
This way, if you had another URL that contained content similar (similar enough to be considered duplicate) to the above URL, but which you considered being more important, the Google spider would give that page priority in the search results and ignore the one which contained duplicate content.
Many companies choose to use different versions of their URL in order to create links back to their site. For example, if your main URL is http://www.demo.com, you might choose to also use a non-www version of the URL such as http://demo.com. Using the Google Webmaster tools you can suggest which of these domains you would like to set as your preferred domain, which causes Google to crawl your site and index information from your site according to that domain.
Of course, it's important to remember that it will take some time before Google begins to index your site the way that you suggest to them if you hadn’t originally chosen a preferred domain. It might also be a good idea to use a 301 redirect to send traffic from your non-preferred domain.
If you have several versions of one page, all of which contained what might be considered to be duplicate content, you can indicate to the search spiders as to which of these pages is your primary page. For example, if you have the following two URLs…
…and you wish for the first piece to be considered the primary page, you can include a piece of code in a the head section of the second page:
This code basically tells the search engine spiders that the contents of the second page referred back to the contents of the first page. This method can be duplicated for every additional page which contains the same information as the primary page:
Including your preferred pages as part of your site map indicates to Google as to which pages of your site be considered to be the most important. While isn't as guaranteed as the above three methods, it's important to set up a site in order to make your site more callable and indexable for the Google search engine spiders.
If you have a lot of places on your site where there is duplicate content that absolutely cannot be rewritten, you can always block Google spiders from accessing these areas of your website using a robots.txt file, or a noindex meta tag, This encourages the search engine spiders to index only the most important pages of your site and keeps you from having to compete with yourself for space in the Google search engine results. However, Google still suggests that the best method for keeping their spiders from crawling pages that include duplicate content is to use the four canonization methods above.