Why is the MySQL MyISAM table the same size after interleaving some data from the VARCHAR column?

I need to reduce the size of the MySQL database. I transcoded some information that strip ';' and ':' from the sources column (~ 10% char reduction). After that, the size of the table will be the same as before. How is this possible? I use the MyISAM engine.

btw: Unfortunately, I cannot compress tables using myisampack .

 mysql> INSERT INTO test SELECT protid1, protid2, CS, REPLACE(REPLACE(sources, ':', ''), ';', '') FROM homologs_9606; Query OK, 41917131 rows affected (4 min 11.30 sec) Records: 41917131 Duplicates: 0 Warnings: 0 mysql> select TABLE_NAME name, ROUND(TABLE_ROWS/1e6, 3) 'million rows', ROUND(DATA_LENGTH/power(2,30), 3) 'data GB', ROUND(INDEX_LENGTH/power(2,30), 3) 'index GB' from information_schema.TABLES WHERE TABLE_NAME IN ('homologs_9606', 'test') ORDER BY TABLE_ROWS DESC LIMIT 10; +---------------+--------------+---------+----------+ | name | million rows | data GB | index GB | +---------------+--------------+---------+----------+ | test | 41.917 | 0.857 | 1.075 | | homologs_9606 | 41.917 | 0.887 | 1.075 | +---------------+--------------+---------+----------+ 2 rows in set (0.01 sec) mysql> select * from homologs_9606 limit 10; +---------+---------+-------+--------------------------------+ | protid1 | protid2 | CS | sources | +---------+---------+-------+--------------------------------+ | 5635338 | 1028608 | 0.000 | 10:,1 | | 5644385 | 1028611 | 0.947 | 5:1,1;8:0.943,35;10:1,1;11:1,1 | | 5652325 | 1028611 | 0.947 | 5:1,1;8:0.943,35;10:1,1;11:1,1 | | 5641128 | 1028612 | 1.000 | 8:1,10 | | 5636414 | 1028616 | 0.038 | 8:0.038,104;10:,1 | | 5636557 | 1028616 | 0.000 | 8:,4 | | 5637419 | 1028616 | 0.011 | 5:,1;8:0.011,91;10:,1 | | 5641196 | 1028616 | 0.080 | 5:1,1;8:0.074,94;10:,1;11:,4 | | 5642914 | 1028616 | 0.000 | 8:,3 | | 5643778 | 1028616 | 0.056 | 8:0.057,70;10:,1 | +---------+---------+-------+--------------------------------+ 10 rows in set (4.55 sec) mysql> select * from test limit 10; +---------+---------+-------+-------------------------+ | protid1 | protid2 | CS | sources | +---------+---------+-------+-------------------------+ | 5635338 | 1028608 | 0.000 | 10,1 | | 5644385 | 1028611 | 0.947 | 51,180.943,35101,1111,1 | | 5652325 | 1028611 | 0.947 | 51,180.943,35101,1111,1 | | 5641128 | 1028612 | 1.000 | 81,10 | | 5636414 | 1028616 | 0.038 | 80.038,10410,1 | | 5636557 | 1028616 | 0.000 | 8,4 | | 5637419 | 1028616 | 0.011 | 5,180.011,9110,1 | | 5641196 | 1028616 | 0.080 | 51,180.074,9410,111,4 | | 5642914 | 1028616 | 0.000 | 8,3 | | 5643778 | 1028616 | 0.056 | 80.057,7010,1 | +---------+---------+-------+-------------------------+ 10 rows in set (0.00 sec) mysql> describe test; +---------+------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------+------------------+------+-----+---------+-------+ | protid1 | int(10) unsigned | YES | PRI | NULL | | | protid2 | int(10) unsigned | YES | PRI | NULL | | | CS | float(4,3) | YES | | NULL | | | sources | varchar(100) | YES | | NULL | | +---------+------------------+------+-----+---------+-------+ 4 rows in set (0.00 sec) mysql> describe homologs_9606; +---------+------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------+------------------+------+-----+---------+-------+ | protid1 | int(10) unsigned | NO | PRI | 0 | | | protid2 | int(10) unsigned | NO | PRI | 0 | | | CS | float(4,3) | YES | | NULL | | | sources | varchar(100) | YES | | NULL | | +---------+------------------+------+-----+---------+-------+ 4 rows in set (0.00 sec) 

EDIT1: Added average column length.

 mysql> select AVG(LENGTH(sources)) from test; +----------------------+ | AVG(LENGTH(sources)) | +----------------------+ | 5.2177 | +----------------------+ 1 row in set (10.04 sec) mysql> select AVG(LENGTH(sources)) from homologs_9606; +----------------------+ | AVG(LENGTH(sources)) | +----------------------+ | 6.8792 | +----------------------+ 1 row in set (9.95 sec) 

EDIT2: I managed to remove a few more MB by setting NOT NULL to all columns.

 mysql> drop table test Query OK, 0 rows affected (0.42 sec) mysql> CREATE table test (protid1 INT UNSIGNED NOT NULL DEFAULT '0', protid2 INT UNSIGNED NOT NULL DEFAULT '0', CS FLOAT(4,3) NOT NULL DEFAULT '0', sources VARCHAR(100) NOT NULL DEFAULT '0', PRIMARY KEY (protid1, protid2), KEY `idx_protid2` (protid2)) ENGINE=MyISAM CHARSET=ascii; Query OK, 0 rows affected (0.06 sec) mysql> INSERT INTO test SELECT protid1, protid2, CS, REPLACE(REPLACE(sources, ':', ''), ';', '') FROM homologs_9606; Query OK, 41917131 rows affected (2 min 7.84 sec) mysql> select TABLE_NAME name, ROUND(TABLE_ROWS/1e6, 3) 'million rows', ROUND(DATA_LENGTH/power(2,30), 3) 'data GB', ROUND(INDEX_LENGTH/power(2,30), 3) 'index GB' from information_schema.TABLES WHERE TABLE_NAME IN ('homologs_9606', 'test'); Records: 41917131 Duplicates: 0 Warnings: 0 +---------------+--------------+---------+----------+ | name | million rows | data GB | index GB | +---------------+--------------+---------+----------+ | homologs_9606 | 41.917 | 0.887 | 1.075 | | test | 41.917 | 0.842 | 1.075 | +---------------+--------------+---------+----------+ 2 rows in set (0.02 sec) 
+5
source share
2 answers

They are not exactly the same. Your query clearly shows that test less than <30> less than homologs_9606 :

 +---------------+--------------+---------+ | name | million rows | data GB | +---------------+--------------+---------+ | test | 41.917 | 0.857 | <-- 0.857 < 0.887 | homologs_9606 | 41.917 | 0.887 | +---------------+--------------+---------+ 

How much storage do we expect for your table? Let's check the data warehouse requirements :

 INTEGER(10): 4 bytes FLOAT(4): 4 bytes VARCHAR(100): L+1 

where L is the number of character bytes, which is usually one byte per character, but sometimes more if you use the Unicode character set.

On average, you will need the following lines:

 INTEGER + INTEGER + FLOAT + VARCHAR = 4 + 4 + 4 + (L + 1) = L + 13 bytes 

We can print your original average L as (0.887*1024^3 / 41917131) - 13 = 9.72 . You say that you shared 10% from sources , which means that your new L 9.72*0.9 = 8.75 . This gives the expected new total storage capacity ((8.75 + 13) * 41917131) / 1024^3 = 0.849 GB

I suspect that the difference (between 0.849 and 0.857) may be due to the fact that test have two columns defined as NULLable, which homologs_9606 do not have, but I do not know enough about the MyISAM mechanism to calculate this exactly. However, I can guess! At a minimum, you will need 1 bit per column per row to store the NULL state, which in your case means two bits per row or 2*41917131 = 83834262 bits = 10 479 283 bytes = 0.010 GB . The final 0.849+0.010 = 0.859 slightly breaks the goal (about 2 MB too much). But I did a few rounds, and your 10% is also an estimate, so I'm sure the rest is lost in translation.

Another reason could be if you are using the Unicode character set on sources in test , in which case some characters can use more than one byte each, but since the NULLable columns seem to take into account everything I donโ€™t think this applies to your table.

Summary

  • Your two tables are not the same size, they differ by 30 MB.
  • The size of your new table depends on the expected size.
  • You can save some more space in your new table by making protid1 and protid2 into NOT NULL columns.
+2
source

The "table" is stored in the .MYD . This file will never be shortened due to UPDATEs or DELETEs . SHOW TABLE STATUS (or the equivalent query in information_schema ) may show the abbreviation Data_length , but the Data_free will increase.

You can compress the .MYD file by doing OPTIMIZE TABLE . But this will copy the table, thereby requiring additional disk space during the process. And this action is very rarely worth doing.

Switching to NOT NULL may not free up space if you have a lot of zeros - "" takes 1 or 2 bytes for VARCHAR due to the length. (And your code may have to be processed '' differently than NULL .)

The space taken for each row is actually 1 byte larger than previously indicated - this byte processes information about whether the row exists or is the beginning of the hole.

For large text fields, I like to do this to save space. (This applies to both MyISAM and InnoDB.) Compresses the text and saves it in the BLOB column (instead of TEXT ). For most texts, this is a 3: 1 shrink. 1. The client requires a little extra code and processor time, but it saves a lot of I / O on the server. Often, the net result is "faster." I would not use it for a cook; I would only do this on columns more than, say, 50 characters.

Return to the original question. It appears that the entire table had only about 30 million colons and a semicolon. Maybe the first 10 lines are not representative?

0
source

Source: https://habr.com/ru/post/1238930/


All Articles