We have an application that stores data in a SQL server database. (We currently support SQL Server 2005 and higher). Our database has over 400 tables. Database structure is not perfect. The biggest problem is that we have many tables with GUIDs (NEWID ()) as the main CLUSTERED Keys. When I asked our chief database architect โwhy?โ, He said, โthis is because of replication.โ Our database must support transactional replication. Initially, all primary keys were INT IDENTITY (1,1) CLUSTERED. But later, when it came to supporting replication, these fields were replaced with UNIQUEIDENTIFIER DEFAULT NEWID (). He said: "Otherwise, it was a replication nightmare." At that time, NEWSEQUENTIALID () was not supported in SQL 7/2000. So now we have tables with the following structure:
CREATE TABLE Table1( Table1_PID uniqueidentifier DEFAULT NEWID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1 PRIMARY KEY CLUSTERED (Table1_PID) ) GO CREATE TABLE Table2( Table2_PID uniqueidentifier DEFAULT NEWID() NOT NULL, Table1_PID uniqueidentifier NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table2 PRIMARY KEY CLUSTERED (Table2_PID), CONSTRAINT FK_Table2_Table1 FOREIGN KEY (Table1_PID) REFERENCES Table1 (Table1_PID) ) GO
All tables actually have many fields (up to 35) and up to 15 nonclustered indexes.
I know that a GUID that is not sequential - like the one that has values โโgenerated on the client (using .NET) OR generated by the NEWID () SQL function (as in our case), is a terribly poor choice of a clustered index of two reasons:
I also know that a GOOD clustering key is as follows:
- unique,
- narrow
- static
- constantly growing
- nonzero
- and fixed width
For more on the reasons for this, watch the following great video: http://technet.microsoft.com/en-us/sqlserver/gg508879.aspx .
So, INT IDENTITY is really the best choice. BIGINT IDENTITY is also good, but as a general rule, an INT with 2 + billion rows should be sufficient for the vast majority of tables.
When our customers began to suffer from fragmentation, it was decided to make primary keys non-group. As a result, these tables were left without a clustered index. In other words, these tables have been converted to HEAPS. I personally don't like this solution because I'm sure heap tables are not part of a good database design. Please check this SQL Server Best Practices article: http://technet.microsoft.com/en-us/library/cc917672.aspx .
We are currently considering two options for improving the database structure:
The first option is to replace DEFAULT NEWID () with DEFAULT NEWSEQUENTIALID () for the main clustered key:
CREATE TABLE Table1_GUID ( Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1 PRIMARY KEY CLUSTERED (Table1_PID) ) GO
The second option is to add an INT IDENTITY column to each table and make it a CLUSTERED UNIQUE index, leaving the primary key NOT grouped. So table 1 will look like this:
CREATE TABLE Table1_INT ( Table1_ID int IDENTITY(1,1) NOT NULL, Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1 PRIMARY KEY NONCLUSTERED (Table1_PID), CONSTRAINT UK_Table1 UNIQUE CLUSTERED (Table1_ID) ) GO
Table1_PID will be used for replication (so we left it as PK), while Table1_ID will not be replicated at all.
In short, after we run the tests to figure out which approach is better, we find that both solutions are not very good:
The first approach (Table1_GUID) revealed the following disadvantages: although the sequential GUID is definitely much better than regular random GUIDs, they are still four times as large as INT (16 vs 4 bytes) and this is a factor in our case, because we have many lines in our tables (up to 60 million) and many non-clustered indexes on these tables (up to 15). A clustering key is added to each nonclustered index, so it significantly increases the negative effect from 16 to 4 bytes. More bytes means more pages on disk and in SQL Server RAM, and therefore more disk I / O and more work for SQL Server.
To be more precise, after I entered 25 million rows of real data into each table and then created 15 non-clustered indexes for each table, I saw a big difference in the space used by the tables:
EXEC sp_spaceused 'Table1_GUID' -- 14.85 GB EXEC sp_spaceused 'Table1_INT' -- 11.68 GB
In addition, the test showed that the INSERTs in Table1_GUID were slightly slower than in Table1_INT.
The second approach (Table1_INT) showed that in most queries (SELECT) joining two tables in Table1_INT.Table1_PID = Table2_INT.Table1_PID, the execution plan got worse because an additional Key Lookup operator appeared.
Now the question is: I believe that there should be a better solution to our problem. If you could recommend me something or point to a good resource, I would really appreciate it. Thank you in advance.
Update
Let me give an example of a SELECT statement where an additional Key Lookup statement appears:
--Create 2 tables with int IDENTITY(1,1) as CLUSTERED KEY. --These tables have one-to-many relationship. CREATE TABLE Table1_INT ( Table1_ID int IDENTITY(1,1) NOT NULL, Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1_INT PRIMARY KEY NONCLUSTERED (Table1_PID), CONSTRAINT UK_Table1_INT UNIQUE CLUSTERED (Table1_ID) ) GO CREATE TABLE Table2_INT( Table2_ID int IDENTITY(1,1) NOT NULL, Table2_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Table1_PID uniqueidentifier NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table2_INT PRIMARY KEY NONCLUSTERED (Table2_PID), CONSTRAINT UK_Table2_INT UNIQUE CLUSTERED (Table2_ID), CONSTRAINT FK_Table2_Table1_INT FOREIGN KEY (Table1_PID) REFERENCES Table1_INT (Table1_PID) ) GO
And create the other two tables for comperison:
--Create the same 2 tables, BUT with uniqueidentifier NEWSEQUENTIALID() as CLUSTERED KEY. CREATE TABLE Table1_GUID ( Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1_GUID PRIMARY KEY CLUSTERED (Table1_PID), ) GO CREATE TABLE Table2_GUID( Table2_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Table1_PID uniqueidentifier NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table2_GUID PRIMARY KEY CLUSTERED (Table2_PID), CONSTRAINT FK_Table2_Table1_GUID FOREIGN KEY (Table1_PID) REFERENCES Table1_GUID (Table1_PID) ) GO
Now run the following select statements and look at the execution plan to compare:
SELECT T1.Field1, T2.FieldN FROM Table1_INT T1 INNER JOIN Table2_INT T2 ON T1.Table1_PID = T2.Table1_PID; SELECT T1.Field1, T2.FieldN FROM Table1_GUID T1 INNER JOIN Table2_GUID T2 ON T1.Table1_PID = T2.Table1_PID;
