MySQL database structure: more columns or more rows

I collect how people tag topics with categories in a table, for example:

ID | topic_id | votes_Category_1 | votes_Category_2 |.......... | votes_Category_12 

I discard this table every hour for historical reasons. Suppose a table contains 2 million rows. dumped every hour in the tables of history.

This solution is not flexible if I want to add a Category_13 column, so I think about it:

 ID | topic_id | Category_id | vote_count 

This solution will create 12 lines for each topic, will be better structured and more flexible, but I will have to drop 24 million lines every hour.

I need the top 10 topics in each category! Interestingly, in case 2 using Max on votes (where category_id = x and topic_id = y) will be slower than in case 1: Order by categoy_x, where topic_id = y

Which one will be better than JUST !!! in terms of performance:

  • To have 2 million rows with 14 columns
  • To have 24 million rows with 4 columns

thanks

+4
source share
1 answer

I would look at the search patterns to decide on the approach.

  • If you extract topics into categories, I would go with the second approach, define an index in the category field so that all entries for this category are kept adjacent (relative) to the disk, resulting in fewer loaded pages on the disk. This is also due to the smaller record size compared to a table with all categories in the form of columns. The advantage is the flexibility to add more categories, and the disadvantage is the repetition of columns (ID, TopicID), which affect the overall size of the data.

  • If you are extracting by topic, then I would go with the first approach, specifying an index on this topic. This would reduce the repetition of column values ​​(ID, TopicID) for each category, thereby reducing the total size of the data to be stored, and since the number of rows is in millions per hour, this size reduction should be significant. The disadvantage is the need to change the scheme for new categories.

Edit : Reviewing the search patterns from your edit:

I extract the top topics and their meanings for each category, so I order by vote_Category_x in case 1.

I understand this as Find the top N topics with largest number of votes in a given category

In case 2, I would like to find max (category) for each topic_id.

and it's like SELECT TopicID, MAX(votes) FROM TABLE GROUP BY TopicID, Category .

The record size is different for 2million and 24million rows, but yes, the ID and TopicID are repeated, which will certainly increase the data size by 8 bytes for each record.

The first table stores 2 million records of 60 bytes (4*15 ints) , and the second table stores 24 million records of 16 bytes (4*4 ints) . The second table would add ~62 4KB pages each, per hour. It seems like a worry over a period of time. It will also affect fragmentation due to data entry in the middle, since the index is organized into categories for the second approach.

It may be worthwhile to perform some performance tests to better understand this, as well as weigh the frequency of adding categories before moving on to one of the table structures.

+2
source

Source: https://habr.com/ru/post/1440954/


All Articles