Let me first off start of by saying there may be “the right way” to do this, but I have not found it yet. A site such as Digg would have advanced techniques to deal with this.
In my given scenario, a user enters in a URL and results are presented. Ideally, each unique URL only shows up once in the database. That way, no matter how the URL is entered in, the user ends up at the correct landing page unique to that URL without duplicated entries in the database. Duplicate entries make the system less efficient as time is spent on the multiple instances of the URL as opposed to just one location.
For example, I may enter in google.com
and another person http://www.google.com/
for their input. These two cases refer to the same intention, but they are not the same. With many websites opting to remove the www.
in front through server-side scripting, this can start to get tricky. The front elements typically include http://
and www
.
Some websites use a subdomain and do not accept a www.
in front. For example, http://ps3.ign.com
is one such example.
Another example are appended modifiers. With a URL such as http://www.nytimes.com/2010/01/09/nyregion/09gis.html?partner=rss&emc=rss
the ?partner=rss&emc=rss
is not necessary for a user to view the site. This can cause duplication in the database.
Unfortunately, I assumed that duplicate entries are inevitable and a fact of life. As such, my goal would be to prevent and fix duplicates – not to eliminate them entirely.
The way that I addressed this was to do a lookup of variants of user input. Extra energy spent? Yes. Duplicates reduced? Hopefully.
The preventative measure:
So for an input, I would concatenate several strings and check for matches. I had the script put a mix of http://www.
in front, http://
in front, and /
in back. These were all ran through matches with the relevant table column. If any of them returned a positive, exact match, I would route the input request to accordingly.
The remedial measure
To deal with a duplicate entry in the table, I created an additional column in the database. By default, the redirect value is null. If the value is set, I would have the routing script redirect there upon any requests to go to the duplicated page.
With any given URL, duplicates are highly likely. Take for instance that with any URL, the URL itself may have several variants that are valid for use (google.com
VS http://www.google.com
). Also, many pages have appended $_GET
values (such as ?partner=rss&emc=rss
). Then, the recent mass resurgence in URL shorting services (bit.ly) add another layer of URLs that all redirect to the net same page.
It seems to me that duplicated URLs in a data set where each URL is intended to represent a unique page is inevitable given a large enough collection.