It’s not so much a problem that duplicate content may exist on a site, because it can easily be stopped from being indexed by the use of robots.txt or meta tags. The problem as I see it is how far to go. A good example is this Wordpress blog.

When I was setting it up a couple of weeks ago, I was searching for any good SEO tips for it, and came across the duplicate content issue. The home page, category pages, archives, tagging and search pages - all could be seen as duplicates of the primary content contained within the posts. After some more reading, I found many people saying that it is best to stop these duplicate pages from being indexed at all.

Certainly I see no point in indexing archive pages, since the keyword density will likely not be too high. I could even make the case for tags, since we have categories as another equally keyword rich content source. The front page of course should be indexed, but the paged content should probably not be (again, the keyword density would probably not be any higher than for categories). So that leaves me with indexing the front page, the article pages themselves, wordpress pages (as opposed to paged content) … and category pages.

Should I index the category pages, I thought to myself? On the one hand, the category pages are duplicates in that they are replicas (albeit excerpts in my case) of the article content. I also sometimes put an article into more than 1 category, so that means more potential duplicate pages. On the other hand, a category may well have a higher relevant keyword density than a single article taken from that category, which could mean it indexes better in the search engines.

Initially, I chose to stop category pages from being indexed also, thinking that 1 page per article (if we exclude the front page excerpts) will mean the last possible chance of being penalised for having duplicate content on site. Earlier today, I was reading around some more, trying to see if there was anymore guidance on this issue, but nothing really stood out. Some articles still proclaim that all duplicate content should be prevented from being indexed, whilst others say that duplicate content is more of a scare tactic than anything else.

It seems to me that google is not out to penalise the average user from having some duplicate content on their site. They say so much in their help section. But from the articles I have read, the introduction of meta tags or robots.txt usage has helped some, showing that google probably doesn’t always get it right. Until they do find a more accurate way of discerning malicious content from the average CMS setup, it would seem to require a balancing act on the part of the webmaster.

So, maybe I will allow indexing of my category pages (1st page at any rate) after all, and we will see how things go. Shouldn’t forget that, in my opinion, the biggest factor in all this is to always have links coming from relevant pages, pointing back to your own pages - so that will be the next step!