Work with a lot of text strings

My project, when it is launched, will collect a large number of text block lines (about 20 thousand and the largest number that I saw about 200 thousand of them) for a short period of time and store them in a relational database. Each line text is relatively small, and the average value will be about 15 short lines (about 300 characters). The current implementation is in C # (VS2008), .NET 3.5 and BDD is Ms. SQL Server 2005

Performance and storage are important to a project, but performance and storage will be a priority. I am looking for answers to them:

  • Do I have to compress the text before saving them to the database? or can SQL Server worry about storage densification?
  • Do you know what would be the best compression algorithm / library to use in this context, which gives me the best performance? I am currently just using standard GZip in the .NET framework
  • Do you know any recommendations to handle this? I welcome suggestions outside the box while this is implemented in the .NET framework? (This is a large project, and this requirement is only a small part of it).

EDITED: I will continue to add this to clarify the issues raised.

  • I do not need text indexing or search on this text. I just need to get them at a later stage to display as a text block using its primary key.
  • I have a working solution implemented as described above, and SQL Server has no problems processing it. This program will work quite often and should work with a large data context, so you can imagine that the size will grow very quickly, so every optimization that I can do will help.
+4
source share
7 answers

Strings average 300 characters each. This is either 300 or 600 bytes, depending on the Unicode settings. Say you use a varchar(4000) column and use (on average) 300 bytes each.

Then you have up to 200,000 of them for storage in the database.

This is less than 60 MB of memory. In a country of databases, that is, frankly, peanuts. 60 GB is what I would call a "medium" database.

At the moment, even thinking about compression is a premature optimization. SQL Server can process this amount of text without breaking sweat. If you didn’t specify any system restrictions, I wouldn’t do this until you start to see performance problems, and even then this will most likely be the result of something else, such as a poor indexing strategy.

And the compression of certain types of data, especially very small amounts of data (and 300 bytes, of course, is not enough), in fact, can sometimes give worse results. You can get “compressed” data, which is actually more than the original data. I assume that most of the time, the compressed size is likely to be very close to the original size.

SQL Server 2008 can perform page-level compression, which will be a slightly more useful optimization, but you are on SQL Server 2005. Therefore, no, you should definitely not try to compress individual values ​​or rows, this does not happen to be worth the effort and can actually worsen the situation.

+2
source

If you can upgrade to SQL Server 2008, I would recommend just enabling page compression, as described here: http://msdn.microsoft.com/en-us/library/cc280449.aspx p>

As an example, you can create a compressed table as follows:

 CREATE TABLE T1 (c1 int, c2 nvarchar(50) ) WITH (DATA_COMPRESSION = PAGE); 

If you cannot use compression in the database, unfortunately, your lines (no more than 300 characters) will not be worth compressing using something like System.IO.Compression . I suppose you could try this.

+2
source

It is not clear what you are asking.

As far as performance is concerned, if you compress rows in memory before storing them in a database, your program will be slower than if you just typed the data directly into a table and let SQL worry about it later. The trade-off is that the sql database will be larger, but 1Tb hard drives are cheap, so storage is really a big deal?

According to your numbers (200K per 300 bytes) you are talking about 60Megs. This is not a very large data set. Have you considered using the bulk copy function in ADO.NET ( http://msdn.microsoft.com/en-us/library/7ek5da1a.aspx ). If all the data is transferred in one table, this should be fun.

This would be an alternative to having something like EF generating a substantially 200K insert.

UPDATE Here is another example: http://weblogs.sqlteam.com/mladenp/archive/2006/08/26/11368.aspx

+1
source

Compression will consume resources and, as a rule, will degrade performance when a significant amount of time is just local communication and processing.

+1
source

It looks like you will find it useful to use Core Data Types

These data types will store up to 2 ^ 31-1 bytes of data

If all your lines are few in number, compression during compression is reduced. Without natuve SQL compression, they will not be searchable if you compress them.

0
source

I would not worry about compressing them. For strings of this size (300 characters or so), this will be more of a headache than worth it. Compressing strings takes time (no matter how small), and SQL Server 2005 does not have its own way of doing this, which means you have to write something for this. If you do this in an application that will hurt your performance, you can write a CLR procedure to do this in the database, but it will still be an extra step for actually using the compressed string in your application (or any other that uses it in in this regard).

Database space is cheap, so you don’t save much by squeezing all the rows. The biggest problem is storing a large number of lines in your application memory. If you regularly return to the database to download some of them and don’t try to cache them all at the same time, I wouldn’t worry about that if you really don’t see the problems.

0
source

It sounds like you are trying to solve a specific non-relational problem with a relational database. Why exactly do you use the database? Of course, this can be done, but some problems simply do not fit. TFS shows that you can go overboard with RDBS after throwing enough hardware on it, but that does not make it a good idea.

0
source

Source: https://habr.com/ru/post/1304207/


All Articles