Effectively detect unique values ​​in a database table

I have a database table with a very large number of rows. This table shows the messages that are logged by the system. Each message has a message type, and this stores its own field in a table. I am writing a website to request this message log. If I want to search by message type, then ideally I would like to have a drop-down list containing the types of messages that appeared in the database. Message types can change over time, so I cannot hardcode the types in the drop-down list. I will need to do some kind of search. Iterating over the entire contents of the table to find unique message values ​​is obviously very stupid, but being stupid in the database field, I am asking for a better way here. Perhaps a separate lookup table may be used as a better idea, in which the database sometimes updates the list of unique message types that I can fill out.

Any suggestions would be highly appreciated.

I am using ASP.NET MVC platform and SQL Server 2005

+4
source share
9 answers

A separate lookup table with an identifier for the type of message stored in your log. This will reduce the size and increase the efficiency of the magazine. It would also “Normalize” your data.

+9
source

Yes, I would definitely go with a separate lookup table. Then you can fill it using something like:

INSERT TypeLookup (Type) SELECT DISTINCT Type FROM BigMassiveTable 

Then you can periodically run a replenishment task to pull new types from your main table that do not yet exist in the lookup table.

+5
source
 SELECT DISTINCT message_type FROM message_log 

is the easiest, but not very effective way.

If you have a list of types that may appear in the log, use this:

 SELECT message_type FROM message_types mt WHERE message_type IN ( SELECT message_type FROM message_log ) 

This will be more efficient if message_log.message_type indexed.

If you do not have this table, but you want to create it, and message_log.message_type indexed, use the recursive CTE to emulate a detailed index scan:

 WITH rows (message_type) AS ( SELECT MIN(message_type) AS mm FROM message_log UNION ALL SELECT message_type FROM ( SELECT mn.message_type, ROW_NUMBER() OVER (ORDER BY mn.message_type) AS rn FROM rows r JOIN message_type mn ON mn.message_type > r.message_type WHERE r.message_type IS NOT NULL ) q WHERE rn = 1 ) SELECT message_type FROM rows r OPTION (MAXRECURSION 0) 
+2
source

I just wanted to formulate the obvious: normalize the data.

 message_types message_type | message_type_name messages message_id | message_type | message_type_name 

Then you can just do without the cached DISTINCT:

Drop down menu

 SELECT * FROM message_types 

For searching

 SELECT * FROM messages WHERE message_type = ? SELECT m.*, mt.message_type_name FROM messages AS m JOIN message_types AS mt ON ( m.message_type = mt.message_type) 

I'm not sure why you need a cached DISTINCT , which you will need to update when you can tweak the circuit slightly and have one with RI.

+1
source

Create an index for the message type:

 CREATE INDEX IX_Messages_MessageType ON Messages (MessageType) 

Then, to get a list of unique message types , you run:

 SELECT DISTINCT MessageType FROM Messages ORDER BY MessageType 

Because the index is physically sorted in MessageType order, SQL Server can very quickly and efficiently scan the index by selecting a list of unique message types.

This works well - this is what SQL Server does well.


Admittedly, you can save some space by specifying a message type table. And if you show only a few messages at a time: then finding a bookmark , as it joins the MessageTypes table, will not be a problem. But if you show hundreds or thousands of messages at the same time, then connecting to MessageTypes can become quite expensive and unnecessary, and it will be faster to have a MessageType stored with the message.

But I would not have the problem of creating an index in the MessageType column and selecting distinct . SQL Server loves such things. But if you find that this is the real load on your server, as soon as you receive dozens of hits per second, then follow another suggestion and cache them in memory.

My personal solution would be:

  • create index
  • select separate

and if I still had problems

  • a cache in memory that expires after 30 seconds.

Regarding the normalized / denormalized problem. Normalization saves space at the cost of a processor when connections are constantly being made. But the logical point of refusing copying is to avoid data duplication, which can lead to inconsistent data.

Are you planning to change the text of the message type, which, if you saved the messages, you would need to update all the lines?

Or is there something that can be said that during the message the message type was “Client response was requested”?

+1
source

Do you find an indexed view? Its result set is materialized and stored in storage, so that the overhead of the search is separate from the rest of what you are trying to do.

SQL Server will take care of automatically updating the view when there is a change in data that, in his opinion, will change the content of the view, so in this regard it is less flexible than Oracle materialized.

0
source

MessageType must be a foreign key in the main table in the definition table containing codes and descriptions of the message type. This will significantly increase the performance of your search.

Sort of

 DECLARE @MessageTypes TABLE( MessageTypeCode VARCHAR(10), MessageTypeDesciption VARCHAR(100) ) DECLARE @Messages TABLE( MessageTypeCode VARCHAR(10), MessageValue VARCHAR(MAX), MessageLogDate DATETIME, AdditionalNotes VARCHAR(MAX) ) 

In this design, your search should only query MessageTypes

0
source

As others have said, create a separate message type table. When you add an entry to the message table, check to see if a message type exists in the table. If not, add it. In either case, send the identifier from the message type table to the message table. This should lead to data normalization. Yes, this is a little extra time when you add a record, but should be more effective when searching.

If a lot more add-ons are added, then if the “message type” is short, a completely different approach is to still create a separate message type table, but not reference it when adding and update only that lazily, on request.

Namely: (a) Include a timestamp in each message entry. (b) Keep a list of message types found the last time you checked. (c) Each time you check, search for new message types added since the last time, for example:

 create table temp_new_types as (select distinct message_type from message where timestamp>last_type_check ); insert into message_type_list (message_type) select message_type from temp_new_types where message_type not in (select message_type from message_type_list); drop table temp_new_types; 

Then save the timestamp of this check somewhere so that you can use it next time.

0
source

The answer is to use "DISTINCT", and each best solution is different for different table sizes. Thousands of lines, millions, billions? More? These are the most different decisions.

0
source

Source: https://habr.com/ru/post/1301020/


All Articles