What is the best way to handle a URL for storage and indexing in SQL Server 2005?
I have a WebPage table that stores metadata and content about web pages. I also have many other tables related to the WebPage table. They all use the URL as a key.
The problem with the URL can be very large, and using them as a key makes indexes larger and slower. As far as I don't know, but I read many times, using large fields for indexing should be avoided. Assuming the URL is nvarchar (400), these are huge fields to use as the primary key.
What are the alternatives?
How much pain it will be, perhaps using a URL as a key instead of a small field.
I looked at a WebPage table that has an identity column, and then used this as the primary key for WebPage. This reduces and improves the efficiency of all related indexes, but makes data import a little sick. Each import for linked tables must first check what the URL identifier is before inserting data into tables.
I also played using a hash in the url to create a smaller index, but I'm still not sure if this is the best way to do something. This will not be a unique index and will be subject to a small number of collisions. So I'm not sure if the foreign key will be used in this case ...
There will be millions of web page entries stored in the database, and there will be many batch updates. There will also be quite a few read and aggregate data operations.
Any thoughts?