What data type is optimal for a clustered index on a table published using transactional replication?

Question

What data type is optimal for a clustered index on a table published using transactional replication?

We have an application that stores data in a SQL server database. (We currently support SQL Server 2005 and higher). Our database has over 400 tables. Database structure is not perfect. The biggest problem is that we have many tables with GUIDs (NEWID ()) as the main CLUSTERED Keys. When I asked our chief database architect “why?”, He said, “this is because of replication.” Our database must support transactional replication. Initially, all primary keys were INT IDENTITY (1,1) CLUSTERED. But later, when it came to supporting replication, these fields were replaced with UNIQUEIDENTIFIER DEFAULT NEWID (). He said: "Otherwise, it was a replication nightmare." At that time, NEWSEQUENTIALID () was not supported in SQL 7/2000. So now we have tables with the following structure:

CREATE TABLE Table1( Table1_PID uniqueidentifier DEFAULT NEWID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1 PRIMARY KEY CLUSTERED (Table1_PID) ) GO CREATE TABLE Table2( Table2_PID uniqueidentifier DEFAULT NEWID() NOT NULL, Table1_PID uniqueidentifier NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table2 PRIMARY KEY CLUSTERED (Table2_PID), CONSTRAINT FK_Table2_Table1 FOREIGN KEY (Table1_PID) REFERENCES Table1 (Table1_PID) ) GO

All tables actually have many fields (up to 35) and up to 15 nonclustered indexes.

I know that a GUID that is not sequential - like the one that has values generated on the client (using .NET) OR generated by the NEWID () SQL function (as in our case), is a terribly poor choice of a clustered index of two reasons:

fragmentation
the size

I also know that a GOOD clustering key is as follows:

unique,
narrow
static
constantly growing
nonzero
and fixed width

For more on the reasons for this, watch the following great video: http://technet.microsoft.com/en-us/sqlserver/gg508879.aspx .

So, INT IDENTITY is really the best choice. BIGINT IDENTITY is also good, but as a general rule, an INT with 2 + billion rows should be sufficient for the vast majority of tables.

When our customers began to suffer from fragmentation, it was decided to make primary keys non-group. As a result, these tables were left without a clustered index. In other words, these tables have been converted to HEAPS. I personally don't like this solution because I'm sure heap tables are not part of a good database design. Please check this SQL Server Best Practices article: http://technet.microsoft.com/en-us/library/cc917672.aspx .

We are currently considering two options for improving the database structure:

The first option is to replace DEFAULT NEWID () with DEFAULT NEWSEQUENTIALID () for the main clustered key:

 CREATE TABLE Table1_GUID ( Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1 PRIMARY KEY CLUSTERED (Table1_PID) ) GO

The second option is to add an INT IDENTITY column to each table and make it a CLUSTERED UNIQUE index, leaving the primary key NOT grouped. So table 1 will look like this:

 CREATE TABLE Table1_INT ( Table1_ID int IDENTITY(1,1) NOT NULL, Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1 PRIMARY KEY NONCLUSTERED (Table1_PID), CONSTRAINT UK_Table1 UNIQUE CLUSTERED (Table1_ID) ) GO

Table1_PID will be used for replication (so we left it as PK), while Table1_ID will not be replicated at all.

In short, after we run the tests to figure out which approach is better, we find that both solutions are not very good:

The first approach (Table1_GUID) revealed the following disadvantages: although the sequential GUID is definitely much better than regular random GUIDs, they are still four times as large as INT (16 vs 4 bytes) and this is a factor in our case, because we have many lines in our tables (up to 60 million) and many non-clustered indexes on these tables (up to 15). A clustering key is added to each nonclustered index, so it significantly increases the negative effect from 16 to 4 bytes. More bytes means more pages on disk and in SQL Server RAM, and therefore more disk I / O and more work for SQL Server.

To be more precise, after I entered 25 million rows of real data into each table and then created 15 non-clustered indexes for each table, I saw a big difference in the space used by the tables:

 EXEC sp_spaceused 'Table1_GUID' -- 14.85 GB EXEC sp_spaceused 'Table1_INT' -- 11.68 GB

In addition, the test showed that the INSERTs in Table1_GUID were slightly slower than in Table1_INT.

The second approach (Table1_INT) showed that in most queries (SELECT) joining two tables in Table1_INT.Table1_PID = Table2_INT.Table1_PID, the execution plan got worse because an additional Key Lookup operator appeared.

Now the question is: I believe that there should be a better solution to our problem. If you could recommend me something or point to a good resource, I would really appreciate it. Thank you in advance.

Update

Let me give an example of a SELECT statement where an additional Key Lookup statement appears:

 --Create 2 tables with int IDENTITY(1,1) as CLUSTERED KEY. --These tables have one-to-many relationship. CREATE TABLE Table1_INT ( Table1_ID int IDENTITY(1,1) NOT NULL, Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1_INT PRIMARY KEY NONCLUSTERED (Table1_PID), CONSTRAINT UK_Table1_INT UNIQUE CLUSTERED (Table1_ID) ) GO CREATE TABLE Table2_INT( Table2_ID int IDENTITY(1,1) NOT NULL, Table2_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Table1_PID uniqueidentifier NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table2_INT PRIMARY KEY NONCLUSTERED (Table2_PID), CONSTRAINT UK_Table2_INT UNIQUE CLUSTERED (Table2_ID), CONSTRAINT FK_Table2_Table1_INT FOREIGN KEY (Table1_PID) REFERENCES Table1_INT (Table1_PID) ) GO

And create the other two tables for comperison:

 --Create the same 2 tables, BUT with uniqueidentifier NEWSEQUENTIALID() as CLUSTERED KEY. CREATE TABLE Table1_GUID ( Table1_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table1_GUID PRIMARY KEY CLUSTERED (Table1_PID), ) GO CREATE TABLE Table2_GUID( Table2_PID uniqueidentifier DEFAULT NEWSEQUENTIALID() NOT NULL, Table1_PID uniqueidentifier NULL, Field1 varchar(50) NULL, FieldN varchar(50) NULL, CONSTRAINT PK_Table2_GUID PRIMARY KEY CLUSTERED (Table2_PID), CONSTRAINT FK_Table2_Table1_GUID FOREIGN KEY (Table1_PID) REFERENCES Table1_GUID (Table1_PID) ) GO

Now run the following select statements and look at the execution plan to compare:

 SELECT T1.Field1, T2.FieldN FROM Table1_INT T1 INNER JOIN Table2_INT T2 ON T1.Table1_PID = T2.Table1_PID; SELECT T1.Field1, T2.FieldN FROM Table1_GUID T1 INNER JOIN Table2_GUID T2 ON T1.Table1_PID = T2.Table1_PID;

+4

sql sql-server transactional-replication

Alex May 18, '13 at 17:10

source share

1 answer

marc_s · Answer 1 · 2013-05-18T17:16:53+0000

I use INT IDENTITY for most of my core and clustering keys.

You need to split the primary key , which is a logical construct - it uniquely identifies your lines, it must be unique and stable and NOT NULL . A GUID works just as well for a primary key, as it is guaranteed to be unique. A GUID as your primary key is a good choice if you are using SQL Server replication, as in this case you need to uniquely identify the GUID column.

Key in SQL Server, it’s a physical construct that is used to physically organize the data and it’s much harder to get right. Typically, Queen of Indexing on SQL Server, Kimberly Tripp, also requires that a good clustering key be unique, stable, as narrow as possible, and ideally constantly increasing (which corresponds to INT IDENTITY ).

See her indexing articles here:

and also see Jimmy Nilsson Cost GUID as primary key

A GUID is a really bad choice for a clustering key because it is wide, completely random, and therefore leads to poor index fragmentation and poor performance. In addition, clustering key strings are also stored in each record of each non-clustered (optional) index, so you really want to keep it small - the GUID is 16 bytes and INT . 4 bytes, and with several non-clustered indexes and several million rows, this makes a HUGE difference.

In SQL Server, your default primary key is your clustering key, but this is optional. You can easily use the GUID as your primary key with a non-clustered key and INT IDENTITY as the clustering key - it just understands this a bit.

What data type is optimal for a clustered index on a table published using transactional replication?

More articles: