| |
|
|
Duplicate Pages and Search Engines |
 |
|
|
| Publisher: |
Martin Lukac |
| Date: |
2007-02-27 |
|
| Ranking |
Click at the star to rank |
| Ranking Level |
|
0 |
| No. ranking |
0 |
|
|
|
| |
Sponsored Links
Most of us been there when it comes to duplicate pages. You may have news, articles from news companies that provide you with the news feed. Or you might have your website designed by third party that uses same tools on all their clients' websites.
But did you know that this duplicate content can lower your search engine ranking? It today SEO driven website world it is important to avoid duplicate content as much as possible.
What are duplicate pages for search engines?
It is obvious that duplicate means same as something else. For search engines it means same exact page, same exact title, keywords etc. Search Engines send their spiders, for example Google's spider is called googlebot, Yahoo has yahoo slurp, to crawl your website for content or information. These spiders look for certain information: What is this page about? Is it relevant to your website at all? etc.
If any of these search engines find same content on a different website, it starts to compare the content. First is looks when this content was indexed and from website it was submitted first to search engines. Than it starts to look for links. Are there any links on these pages that point to same source? This way search engine can determine who is the priority website and assigned a higher rank to that site.
But what about rest of the web-sites? What will happen to them? If you have an articles, news that are duplicate content you website will go lower in search engines. It may take some time until search engine actually visits your website and assigns a duplicate content filter. Search engines will simple filter your website out of their search.
What if you pay a licensing fee for these articles!
The only way for search engines such as Google to understand this, is when you e-mail them and explain about your content. You will need to do a re-inclusion request.
What if you like free articles published on different article sites!
No matter what you do, search engine will hit you with duplicate content, however you can teach search engine about your content.
First, reduce number of articles that you have copied. Get some new fresh well written articles. Make sure that majority of your content is your own and rest is articles.
Second, articles must be related to your field. If you write about real estate, make sure you stick with real estate, finance articles only. Show search engines that you have categories and your readers benefit from these articles.
Third, do a re-inclusion request and wait. In few weeks you should see some movement in your site ranking. If it is not improving, reduce the number of articles again.
With free articles is cat and mouse game all over. It is hard to figure out as how algorithm is set in Google for duplicate content as well as how Google deals with re-produced articles that you can freely post anywhere.
What if you have a website that was designed by third company?
Most website companies use same templates or similar templates. That is they give you and rest of their clients same and exact FAQ pages, About Us pages, etc.
You want to avoid this as soon as possible. Always rewrite these articles by yourself. This way you will create unique pages for your website.
Duplicate content is one of the biggest mistakes many webmasters do when it comes to creating websites. Always try to create your own posts and if you really want to expand your article base, make sure you provide a valuable content to your users, not just another article website.
|
|
| |
|
Duplicate Pages and Search Engines Keywords: |
|
|
|
Duplicate Pages Duplicate Pages and Search Engines SEO Internet Business |
|
| |
|
|
| |
| |
 |
Related Article:Duplicate Pages and Search Engines |
Collapse All
|
 |
|
| |
 |
|
Fritz Dorado |
2007-07-09 |
|
|
|
Title: Use A Duplicate Content Checker To Boost Your Traffic
|
|
The buzz on duplicate content penalties is almost deafening. Some people think it's a myth while others strongly believe that search engines are out to hunt down these so-called posers and give them the worst punishment possible. Regardless of their accurate definition, duplicate content penalties do occur. The bottom line is that search engines aren't big fans of duplicate content at all, so why even have it on your website? The last thing any search engine would want is to give its users an unsatisfying search experience. They are doing everything in their power to provide optimum search results. By constantly improving their algorithms and filtering duplicate content, they are presenting their users with the most relevant and unique listings for search results. This is the main reason you use search engines in the first place. For them to work to your advantage as a website owner or blogger, you will need high-quality content that is both unique and informative. This way, search engine results related to your niche pull up your page as a primary valid listing. How do search engines deal with duplicate content exactly, you ask? Google, for instance, uses a supplemental index found within its database that acts as a filtering mechanism. Basically, it weeds out websites and blogs that have duplicate content. They use spiders called Googlebots to collect and analyze similar content found in different web pages. They select a few of these web pages and present them in related searches. Meanwhile, those that are disregarded are placed in Google's supplemental index. This doesn't mean your site is thrown into the void, never to be found again; it is merely positioned at the end of search listings, which makes it almost impossible for search engine users to stumble upon your site. Duplicate content doesn't do you or your site any good at all. You want significant traffic to pour into your site. The best solution to boost traffic for your site with SEO is to create original content. Writing unique content to your readers is like coming up with a remedy for a particular disease. People are always looking for something that would satisfy their curiosity, but if you give them information that they've already been hearing a thousand times over, then you are not really offering anything new to the table. A good website or blog thrives on well-written and originative content -- that is a fact. By providing original content, you are giving search engine users a pretty good reason to visit your site. It isn't easy to come up with purely original content all the time. You do your best to write original content, but sometimes it still isn't enough. The good news is that there are tools available for you to maximize your original text output. The best of the lot, I would say, is a duplicate content checker. This tried-and-tested tool analyzes and checks your articles for duplicate texts. A duplicate content checker basically goes over your own material, checks it against other available web content, and hits you with a red flag if matching texts are detected. All in all, without original content, your site could just be as good as invisible. Be seen and be a valuable source of online content. Write unique copies and use a duplicate content checker every chance you get. By doing so, you're sure to get some Google-love and, ultimately, a decent amount of traffic into your site. About the author Fritz Dorado is a consultant with Webmasterlabor.com. She highly recommends using the duplicate content checker for better website content.
|
| |
 |
|
Vinay Choubey |
2007-02-15 |
|
|
|
Title: The Most Common Reason for Dropped Ranking
|
|
A duplicate website is a website that has many if not all of the same pages as another live website. Duplication is the most common reason for dropping the ranking of websites. The major search engines are constantly trying to improve the quality of their search engine results in an effort to provide the best quality content for users. When duplicate content is indexed by search engine spiders, valuable time and processing power is wasted. As a result, search engines have blocked sites that used duplicate content from their database, ultimately favouring the site that either had the content first, or I believe, the one site that has the greater online history. In addition, the major search engines have a bad taste after dealing with so much duplicate content created by spammers over the past several years. As a result, posting a duplicate website is an offense that can quite literally blacklist a domain; there are few things the search engine properties dislike more than being gamed by spammers. Deleting the site is the only option unless you want to create an entire new website with unique content and a unique purpose. That said, by deleting the website you can still ensure the effort you put into promoting the old site does not go to waste by pointing the domain to your new website's domain using a 301 redirect. A 301 is a term used to describe a server protocol which Google and other search engines will 'see' when they visit the old site. The protocol essentially says that your content from the old site can be found on the new site and that this is a permanent forwarding of all traffic. 301 redirects are by far the best way to minimize your losses from shutting down a website that just might have traffic or inbound links. It is very important that you keep the website that has the most backlinks and has been online the longest. Switching a website to a new domain is a dangerous step. This is because of Google's famed 'sandbox'. The 'sandbox' is really only an overused turn of phrase that represents a portion of the Google algorithm which considers the age of the domain as a signifier of trust. Generally, new websites will require 6 months to a year before substantial rankings are evident; this is kind of a right of passage that Google appears to be enforcing on the average website. Sites that are obviously popular and quickly gain a load of legitimate link popularity will easily avoid the sandbox (because Google can not afford to miss a 'great' website) but this is not the common scenario. How to avoid Duplicacy: In most cases the amount of duplicate content used within a template in a content management system (CMS) is negligible. If, however, you have a large number of pages created using a page where 90% of the text is duplicated and only 10% is unique you do have a reason to make some changes. In my opinion it is crucial that every page within a website be composed mostly of unique content with the exception of catalogues and shopping carts where text simply has to be reused over and over. For more details on Website Ranking visit at www.halfvalue.com and www.halfvalue.co.uk For more Books information visit www.lookbookstores.com
|
| |
 |
|
Monica Corral - Lorica |
2007-02-08 |
|
|
|
Title: Guide To Search Engine Paid Inclusion
|
|
From the hundreds of search engines today, some are offering paid URL inclusion. What is paid URL inclusion and how does it differ from the regular free listing on the search engines. Paid URL inclusion means you have to pay for a particular amount annually for some search engines to index your web pages. Since all the search engines have “crawlers” and “spiders” to index all web pages online, why is there still a web marketing technique like paid URL inclusion? It is true that the search engines will eventually find your web pages and index them; however, it could take at least of couple of months for them to finally index the entire website. This is where paid URL inclusion advances. It offers faster indexing of your web pages. Paid URL Inclusion Process Search engines that offer paid URL inclusion service promise websites of a faster spidering and indexing of the website’s pages. Faster indexing means faster visibility of your website on the search engine page results. This is done through an extra “spider” that indexes paid inclusions. This extra spider indexes paid inclusions faster than the regular “spider” indexing web pages for free. The primary difference between the free URL inclusion and the paid URL inclusion in the search engines is the speed at which the pages of your website are indexed. You know how speed matters in terms of indexing your website, right? This is the reason why paid URL inclusion is so tempting. Paid Inclusion Annual Fee And Renewal Fee? Paid URL inclusion asks a specific amount to be paid by the website owner in exchange of the faster spidering and indexing of the web pages. Some webmasters raise questions as to why there is still a renewal fee when in fact the website and all its pages are already fully indexed. The renewal fee is actually for the purpose of keeping your website on the search engine’s index. Be informed that search engines offering paid URL inclusion remove your website and its web pages from the search engines’ index after the pay period. This means you have to wait for your site to be spidered again. Your website will be lost for a particular duration of time until it is indexed again. Is Paid URL Inclusion For You? The fact remains that the search engines will still spider your web pages and index your website whether you pay an annual fee for paid URL inclusion or not. Thinking of the long wait ahead for free search engine URL submission, you may opt to use search engine paid URL inclusion instead. If you want immediate spidering and indexing of your web pages and your entire website, paid URL inclusion is the web marketing service for you! This article is written by nPresence an online web marketing agency that specializes in Search Engine Optimization, Pay Per Click advertising, Content Management Systems, Web Design, Conversion Tracking and Analysis. For all your pay per click and web marketing needs, please see Pay Per Click Services.
|
| |
 |
|
Fritz Dorado |
2007-07-15 |
|
|
|
Title: Use A Duplicate Content Checker To Boost Your Traffic
|
|
The buzz on duplicate content penalties is almost deafening. Some people think it's a myth while others strongly believe that search engines are out to hunt down these so-called posers and give them the worst punishment possible. Regardless of their accurate definition, duplicate content penalties do occur. The bottom line is that search engines aren't big fans of duplicate content at all, so why even have it on your website? The last thing any search engine would want is to give its users an unsatisfying search experience. They are doing everything in their power to provide optimum search results. By constantly improving their algorithms and filtering duplicate content, they are presenting their users with the most relevant and unique listings for search results. This is the main reason you use search engines in the first place. For them to work to your advantage as a website owner or blogger, you will need high-quality content that is both unique and informative. This way, search engine results related to your niche pull up your page as a primary valid listing. How do search engines deal with duplicate content exactly, you ask? Google, for instance, uses a supplemental index found within its database that acts as a filtering mechanism. Basically, it weeds out websites and blogs that have duplicate content. They use spiders called Googlebots to collect and analyze similar content found in different web pages. They select a few of these web pages and present them in related searches. Meanwhile, those that are disregarded are placed in Google's supplemental index. This doesn't mean your site is thrown into the void, never to be found again; it is merely positioned at the end of search listings, which makes it almost impossible for search engine users to stumble upon your site. Duplicate content doesn't do you or your site any good at all. You want significant traffic to pour into your site. The best solution to boost traffic for your site with SEO is to create original content. Writing unique content to your readers is like coming up with a remedy for a particular disease. People are always looking for something that would satisfy their curiosity, but if you give them information that they've already been hearing a thousand times over, then you are not really offering anything new to the table. A good website or blog thrives on well-written and originative content -- that is a fact. By providing original content, you are giving search engine users a pretty good reason to visit your site. It isn't easy to come up with purely original content all the time. You do your best to write original content, but sometimes it still isn't enough. The good news is that there are tools available for you to maximize your original text output. The best of the lot, I would say, is a duplicate content checker. This tried-and-tested tool analyzes and checks your articles for duplicate texts. A duplicate content checker basically goes over your own material, checks it against other available web content, and hits you with a red flag if matching texts are detected. All in all, without original content, your site could just be as good as invisible. Be seen and be a valuable source of online content. Write unique copies and use a duplicate content checker every chance you get. By doing so, you're sure to get some Google-love and, ultimately, a decent amount of traffic into your site.
|
| |
 |
|
Danny Wirken |
2006-09-10 |
|
|
|
Title: How And Where Search Engines See Duplicate Content
|
|
Introduction Search engines have become the gateway to information in the Internet. Search engines are so important that websites find that they need to rank well in search engine results pages (SERPs) in order to get noticed. With the numerous websites vying to get into the coveted position of the top 30 results listed in SERPs more and more website owners are using search engine optimization (SEO) techniques to improve their rankings. People who use SEO know that there are certain factors that can affect your ranking positively and of course negatively. Of the negative factors one of the most well-known is duplicate content. Search engines are biased against duplicate content. As a matter of fact some sites do not get listed in SERPs because of this factor. This happens when crawlers do not index sites which they have previously determined to be a duplicate site of another site. The crawlers skip the duplicate site to be more efficient and save time. Crawler also do this for another reason to avoid listing duplicate pages in SERPs and thus point users to different sites containing just the same information. Search engines do not like that to happen because it would be irritating for users who expect to see different sites for the different links they click. For similar sites, search engines also usually just list one of the sites and relegate the others under a link that says See related pages. For those that get manage to be listed in the SERPs the page rank is still usually affected and so affects the sites standing. Where Search Engines See Duplicate Content So where do crawlers see this duplicate content. And what are the possible content that they would interpret as duplicate? According to an article by William Slawski on Duplicate Content Issues and Search Engines, search engines see duplicate content from the following kind of web pages: 1. Product descriptions from manufacturers, publishers, and producers reproduced by a number of different distributors in large ecommerce sites. 2. Alternative print pages This happens when website owners who are user friendly offer copies of the same documents in different formats for a varied printing options. Although helpful to users it might actually indexed by crawlers as duplicate pages. 3. Pages that reproduce syndicated RSS feeds through a server side script. 4. Canonicalization issues, where a search engine may see the same page as different pages with different URLs. 5. Pages that serve session IDs to search engines, so that they try to crawl and index the same page under different URLs. 6. Pages that serve multiple data variables through URLs, so that they crawl and index the same page under different URLs. 7. Pages that share too many common elements, or where those are very similar from one page to another, including title, meta descriptions, headings, navigation, and text that is shared globally. This is common for company websites that insist on having their logo, description, etc put on every page of their website. 8. Copyright infringement Plagiarism is of course a good reason for not being indexed. The problem is that crawlers cannot distinguish the original from the duplicate and might mistakenly filter out the original instead. 9. Use of the same or very similar pages on different subdomains or different country top level domains (TLDs). 10. Article syndication Some writer allow their articles to be published in other websites as long as they are given credit for their work. The problem arises when the crawler sees the original article as the duplicate and opts to index duplicate page or at least give it a higher rating. 11. Mirrored sites Mirrored sites are used to handle the traffic of a very popular site. Mirror sites have a good chance of being ignored by web crawlers and so wont be indexed. How Search Engines See Duplicate Content There are many methods employed by different search engines to determine pages with duplicate content. The methods in many ways, from the concept, to the algorithms, and of course their effectiveness. Search engines are, however, all finding new ways to improve their methods for searching duplicate content as seen by the patents filed by different search engines companies like AltaVista, Microsoft Corporation, Google, and other bodies like the company Digital Equipment Corporation and even the Regents of the University of California. The different patents include methods for Detecting query-specific duplicate documents, Detecting duplicate and near-duplicate files, clustering closely resembling data objects, identifying near duplicate pages in a hyperlinked database, indexing duplicate database records using a full-record fingerprint, indexing duplicate records of information of a database, utilizing information redundancy to improve text searches and methods and apparatus for detecting and summarizing document similarity within large document sets, and for finding mirrored hosts by analyzing URLs. Each method is unique and is interesting in its approach. The methods vary greatly from generating fingerprints for records to using query-relevant information to limit the portion of the documents to be compared. Discussing each method would be interesting and would shed light as to how different search engines approach the problem. The new methods are all innovative and if some of them are used in concert with each other, it would surely improve the search engines ability to detect duplicate documents. However, since the patent holders are competing companies, it is unlikely that there would be collaboration between them. Conclusion As search engines further refine their methods for detecting duplicate content it would be harder for plagiarists to get away with what they do. However, web pages containing duplicate content for a good reason could suffer as well. Furthermore since none of the published patents tackled the issue of differentiating the original content from the duplicate ones refinement in the search engines methods might mean further trouble for the website owners of original content. Because of this search engines ought to find ways and invent new methods for identifying original content from duplicate ones as well as valid duplicate content.
|
| |
 |
|
Andrew Shiveley |
2008-01-30 |
|
|
|
Title: How a Duplicate Content Penalty Can Hurt Your Online Business
|
|
Creating good, unique content for your blog or website can be intensive and time consuming, and you are the person who should reap the full rewards of your efforts. There is, however, a growing trend in internet marketing referred to as 'splogging' (a combination of spam and blogging) that can sometimes inadvertently punish honest webmasters, even if they create unique content from scratch.
Splogging is a practice that the search engines look down upon, and it consists of creating a free blog and using automated software programs to "scrape" together content from various sources. These automated programs are web robots that scour the internet looking for information and articles related to a certain topic, and they will copy various pieces of content from different sources to create a new page.
This scaped content is put into the blog and surrounded with ads, and the owner will then use spam or other dishonest promotional practices to try and drive traffic to his blog. None of the content on this blog or "splog" is original, it is simply articles from different websites that are copied and posted.
When it comes to the merchants of information such as Google and Yahoo, they are constantly monitoring the status of the internet and working to improve their search engine algorithms so that they can deliver relevant and useful information to people. One of the latest developments in search algorithms has been the addition of a "duplicate content filter" which will penalize the ranking of certain webpages if they contain information that is identical to some other webpage.
The search engines look down on this type of behavior because their goal is to provide people with information that is relevant to their search, and they do not want these people going to a page that is poorly designed, stuffed with ads, and filled with unoriginal and useless content. So in order to combat this, the search engines implemented a filter that will block all webpages containing the same content except for one of them.
Even if you are the most honest and genuine webmaster around and the thought of doing something like this would make it hard for you to sleep at night, there is still a chance that you and your online business could be penalized by the duplicate content filter. The way this would happen is that a web robot or content scaper will happen across your webpage and copy its content before it is indexed by the search engines, and then when your content (which may be original and have been painstakingly and laboriously crafted) is posted on the "splog" it will get indexed before your own page does.
Then when the search engines finally discover your page, they will already have a saved copy of a separate page that already contains this exact content and will penalize the rankings of your webpage instead of the unoriginal blog which should be penalized.
Another way that your business could be negatively affected by the dupicate content penalty is if you look to submitting articles as a method to garner traffic and publicity. Many webmasters will choose to create articles for their website or blog and then submit those articles to different article directories.
While this practice in and of itself is not a bad idea, one of the consequences to submitting the same articles that you put on your own website is that the search engines might index the version on the article directory first before indexing the version on your website, and they will slam you with the penalty instead of the directory. This is especially likely if you submit your articles to big, popular content directories such as Ezinearticles and ArticlesBase.
The way to avoid this is to write your articles and publish them on your own website, and then wait about a week or so before you submit them to article directories. Ideally you would want to wait until you can find your article in Google so that you know that it is indexed and that it counts as the original version (meaning all other versions will be penalized).
|
| |
 |
|
sourav |
2007-09-25 |
|
SEO Deadly Sins
Mistakes that reduce web page ranking the following are a list of SEO mistakes can ensure that your site maintains a low ranking with the search engines. Avoid at all costs. * Specifying no title for your page * Excessive use of images or Flash animation on a page * complicated menu systems specifying no title for your page: I cannot stress how important the title of a web page is. Failing to specify a descriptive, keyword optimized title will do untold damage to your ranking with the search engines. It is the equivalent to owning a shop and boarding up its windows. Ideally each page on your site should have a unique, content-specific title following the guidelines specified in the Title Tag Optimization article. Back to top
Excessive use of images or Flash animation on a page:
If your web page has plenty of nice-looking graphics and eye-popping Flash animation and not a lot of textual content it may indeed look nice but have you ever considered what how the search engines might see it. Search engines thrive on textual content, scavenging as much text as they can but unfortunately they cannot understand images or Flash animations like we can and so will find nothing of real value on your page. Try to balance your page so that the textual content is given priority and that any images or animations are used only when needed. Also it is a good idea to attach some text to an image by using its ALT tag as search engines use this text when determining rank. Back to top
Complicated menu systems:
Search engines spiders that crawl through our pages are a relatively primitive bunch. They find in hard to navigate complicated menu systems implemented for example in JavaScript or as a Java applet. Just because it is easy for a human to navigate through the site never assume it will be as easy for a search engine spider. A menu system using simple textual links will be easier for a spider to understand and it will be able to successfully navigate your site. A lot of the time complicated menu systems can be replicated using textual links and CSS. If you must use a complicated menu system be sure to provide a site map that is clearly accessible from the homepage of your site and contains only textual links to your pages. This ensures that even if the spider cannot understand your menu system that it will be able to find the pages on your site....
Link Popularity - Search Engines Page Rank Technology
What is Link Popularity?
Link popularity literally means the popularity of links to your site on other web sites. The more popular your site is - especially with well ranked sites, the higher will your site be ranked. For this reason we prefer to call link popularity as link quality. Most search engines decide the popularity of your individual site pages so you need to build up links to all your major pages. Thus page ranking will vary for your site pages and keywords.
Search Engine Optimization
To optimize your site for search engines, your pages should have relevant content with all related keywords appearing as valid reading matter. If your site doesn't have quality content or you have just repeated your keywords without any sense, your page rank will be affected or worse you could be black listed by the search engines.
|
| |
 |
|
Oleg Ishenko |
2006-12-06 |
|
|
|
Title: Duplicate Content: What You Ought to Know About
|
|
Take a look at your website. How much of your content might be considered as duplicate by a search engine algorithm? Even though you never copy anyone you can't answer 'none' because someone can be copying you. Duplicate content is one of the biggest issues both for search engines trying to keep their results' relevancy high, and webmasters trying to avoid search engine penalties.
Penalties for having duplicate content can be really harmful. This is not just a downgrade in rankings but a move to supplementary results which are hardly visible to the most of the web users. Normally it is expected that Google would select one URL over another to display in SERPs, while duplicates could be found in supplemental results. Unfortunately this is not always so. In the thread "Duplicate content observation" in the WebmasterWorld.com forum you can read about a case when an original high quality and authoritative page was removed from Google's index together with its duplicates. Considering that this can happen even to the most honest webmaster, one can imagine the amount of attention this issue gets on any SEO forum.
Types of Duplicate Content
Duplicate content has a wider definition than the 'copy-paste' plagiarism; it is not just content scrapped from a competitor's site, a SERP or a RSS feed. Apart from this there are few more aspects that are generally referred to as duplicate content.
Circular Navigation
Jake Baille from TrueLocal vaguely defines circular navigation as having multiple paths across website. This can be understood as the same content being accessible via different URLs. An example of the circular navigation could be an article that is retrieved by links like - example.com/articles/1/ , - mysite.com/article1/ - mysite.com/articles.php?id=1
Another legitimate use of multiple URLs is forum threads. Each thread can be accessible by a link like myforum.com/index.php/topic.1201.html , and each message within the tread has a URL like myforum.com/index.php/topic.1201.msg.01.html . In the eyes of a search engine all the links lead to different pages with identical content. Solution? Think of a consistent way of linking, or apply robot.txt exclusion rules.
This can also be the case when other people link to you using differently looking URLs. Since these external links are out of your control, you should create a 301 redirect to the canonical URL you choose to be displayed.
Printer-Friendly Versions
Making a printer friendly version is a common practice and it adds value to the visitors. But printer-friendly version is also a prominent example of duplicate content! Fortunately a simple solution like adding a 'noindex' meta tag to your print pages solves the issue.
Product-Only Pages
Product pages looking similar are common among online stores. Typically they are created using a single template. Often two different product pages share a description that varies in just few words or numbers, which causes them to be filtered out as duplicate content. This issue has no easy solution. Either you rewrite robot.txt to allow only one product description to be crawled and lose SE traffic to the rest of them, or you roll up your sleeves and add something different to each product page, like testimonials, which is time consuming or nearly impossible depending on the number of product types in your stock.
How Do Duplicate Content Filters Work?
There are several algorithms in data mining aiming to detect similar text passages. The one claimed to be used by search engines is w-shingling. Each document has a unique fingerprint or shinglings - the contiguous subsequences of tokens (blocks of text). The ratio of magnitude of union and intersection of two documents' shinglings can be used to determine their resemblance. Another algorithm that can be used for duplicates detection is Levenshtein's distance
It is naturally to expect from a duplicate content filter to be able to discover the origin and rank it higher. The simplest way to detect the origin would be comparing the date of indexing implying that the original source is uploaded and crawled earlier than its copies. But with the advent of the RSS feeds the new content can be distributed instantaneously and this approach is no longer valid.
Concerning the origin's right to be ranked higher - this is not always implemented. J.S.Cassidy in her article 'Duplicate Content Penalties Problems with Googles Filter' published at SEOChat.com tells about an experiment of an article distribution. An article was syndicated twice scoring as many as 19000 copies. After some time Google, Yahoo and MSN have purged their indices leaving just few of the duplicates. MSN's filter managed not only to discover the origin but also put it to the top of the search results. Yahoo has also discovered the origin, but in the results page to the title of the article, the origin's position fluctuated obviously responding to the way Yahoo counts relevancy and authority.
To the tester's amusement Google's refined index did not include the original at all! Evidently Google featured only those pages with copies of the same article which it considered relevant and authoritative with no regard to the original source of the content! I've already mentioned a thread where a similar problem is discussed. The both stories took place in 2005 and early 2006 and so far I found no evidence that this issue is resolved.
|
| |
 |
|
Danny Wirken |
2006-09-09 |
|
|
|
Title: How And Where Search Engines See Duplicate Content
|
|
Introduction
Search engines have become the gateway to information in the Internet. Search engines are so important that websites find that they need to rank well in search engine results pages (SERPs) in order to get noticed. With the numerous websites vying to get into the coveted position of the top 30 results listed in SERPs more and more website owners are using search engine optimization (SEO) techniques to improve their rankings. People who use SEO know that there are certain factors that can affect your ranking positively and of course negatively. Of the negative factors one of the most well-known is duplicate content.
Search engines are biased against duplicate content. As a matter of fact some sites do not get listed in SERPs because of this factor. This happens when crawlers do not index sites which they have previously determined to be a duplicate site of another site. The crawlers skip the duplicate site to be more efficient and save time. Crawler also do this for another reason - to avoid listing duplicate pages in SERPs and thus point users to different sites containing just the same information. Search engines do not like that to happen because it would be irritating for users who expect to see different sites for the different links they click. For similar sites, search engines also usually just list one of the sites and relegate the others under a link that says See related pages. For those that get manage to be listed in the SERPs the page rank is still usually affected and so affects the sites standing.
Where Search Engines See Duplicate Content
So where do crawlers see this duplicate content. And what are the possible content that they would interpret as duplicate? According to an article by William Slawski on Duplicate Content Issues and Search Engines, search engines see duplicate content from the following kind of web pages:
1. Product descriptions from manufacturers, publishers, and producers reproduced by a number of different distributors in large ecommerce sites.
2. Alternative print pages - This happens when website owners who are user friendly offer copies of the same documents in different formats for a varied printing options. Although helpful to users it might actually indexed by crawlers as duplicate pages.
3. Pages that reproduce syndicated RSS feeds through a server side script.
4. Canonicalization issues, where a search engine may see the same page as different pages with different URLs.
5. Pages that serve session IDs to search engines, so that they try to crawl and index the same page under different URLs.
6. Pages that serve multiple data variables through URLs, so that they crawl and index the same page under different URLs.
7. Pages that share too many common elements, or where those are very similar from one page to another, including title, meta descriptions, headings, navigation, and text that is shared globally. - This is common for company websites that insist on having their logo, description, etc put on every page of their website.
8. Copyright infringement - Plagiarism is of course a good reason for not being indexed. The problem is that crawlers cannot distinguish the original from the duplicate and might mistakenly filter out the original instead.
9. Use of the same or very similar pages on different subdomains or different country top level domains (TLDs).
10. Article syndication - Some writer allow their articles to be published in other websites as long as they are given credit for their work. The problem arises when the crawler sees the original article as the duplicate and opts to index duplicate page or at least give it a higher rating.
11. Mirrored sites - Mirrored sites are used to handle the traffic of a very popular site. Mirror sites have a good chance of being ignored by web crawlers and so won't be indexed.
How Search Engines See Duplicate Content
There are many methods employed by different search engines to determine pages with duplicate content. The methods in many ways, from the concept, to the algorithms, and of course their effectiveness. Search engines are, however, all finding new ways to improve their methods for searching duplicate content as seen by the patents filed by different search engines companies like AltaVista, Microsoft Corporation, Google, and other bodies like the company Digital Equipment Corporation and even the Regents of the University of California.
The different patents include methods for Detecting query-specific duplicate documents, Detecting duplicate and near-duplicate files, clustering closely resembling data objects, identifying near duplicate pages in a hyperlinked database, indexing duplicate database records using a full-record fingerprint, indexing duplicate records of information of a database, utilizing information redundancy to improve text searches and methods and apparatus for detecting and summarizing document similarity within large document sets, and for finding mirrored hosts by analyzing URLs.
Each method is unique and is interesting in its approach. The methods vary greatly from generating fingerprints for records to using query-relevant information to limit the portion of the documents to be compared. Discussing each method would be interesting and would shed light as to how different search engines approach the problem. The new methods are all innovative and if some of them are used in concert with each other, it would surely improve the search engine's ability to detect duplicate documents. However, since the patent holders are competing companies, it is unlikely that there would be collaboration between them.
Conclusion
As search engines further refine their methods for detecting duplicate content it would be harder for plagiarists to get away with what they do. However, web pages containing duplicate content for a good reason could suffer as well. Furthermore since none of the published patents tackled the issue of differentiating the original content from the duplicate ones refinement in the search engine's methods might mean further trouble for the website owners of original content. Because of this search engines ought to find ways and invent new methods for identifying original content from duplicate ones as well as valid duplicate content.
|
| |
 |
|
Monica Corral-lorica |
2006-08-11 |
|
|
|
Title: Legitimacy Issue of Duplicate Content in Detail
|
|
Many times, the so-called web marketing experts warn us about duplicate contents. They have stressed the fact that duplicate content will trigger a red flag from the search engines like Google. Though it is true that duplicate content is one of the many factors that the search engines abhor, it is equally true that there are cases where duplicate content becomes acceptable.
When does a duplicate content becomes acceptable? When is it legitimate to have duplicate content? These kinds of questions and other similar issues regarding duplicate content have caused several issues to arise. Here are some of the instances where duplicate content can somehow be considered as acceptable.
When Is Duplicate Content Acceptable?
• The same product listings on two different sites. If you want to include a product listing on two sites that you both own, the search engines may be able to tolerate the case.
• If a particular site has reprinted or copied a particular content from another site, this will be tolerated as long as the copying site has the right to do it with author credits.
• There are webmasters and website owners who would like to create two pages for the same item. One page would be the standard site page and the other one would be a printer friendly page. This would mean that the two pages would have the same content but it is acceptable.
• For reasons that you may not be able to explain, there are cases wherein there appears an odd duplicated page on your site. This usually happens to some sites and this is an honest error.
• Duplicate content may be a product of some errors on the site. By errors, I mean the unintentional ones.
Reasons Why Search Engines Prevent Duplicate Content
Search engines like Google try to avoid sites with duplicate content - in fact, they don't just avoid them, they want to get rid of them. This is the reason why having sites with duplicate content is something to worry about. Why do the search engines detest sites with duplicate contents?
• To prevent duplicate content sites on the internet. Online users will not benefit from browsing different sites with exactly the same content. Replicate sites have exactly the same content including titles and even codes.
• To avoid what others called scraping - a method used by others to duplicate a particular site. This arouses the issue of copyrights.
• To avoid impassive PLR articles for replicated sites.
Knowing the duplicate content will cause your site be penalized by the search engines, you must strive to provide your site with unique data and content. This is important not only to avoid penalties but also to build your site's credibility and image.
|
|
|
| |
| |
 |
Leave Comment |
 |
|
|
| |
| |
|
|
|