Postgres: single table with many columns or multiple tables with fewer columns?

My question is related to the internal principles of Postgres:

I have a table:

CREATE TABLE A ( id SERIAL, name VARCHAR(32), type VARCHAR(32) NOT NULL, priority SMALLINT NOT NULL, x SMALLINT NOT NULL, y SMALLINT NOT NULL, start timestamp with time zone, end timestamp with time zone, state Astate NOT NULL, other_table_id1 bigint REFERENCES W, other_table_id2 bigint NOT NULL REFERENCES S, PRIMARY KEY(id) ); 

with additional indices on other_table_id1, state and other_table_id2.

The table is quite large and sees a lot of updates in the columns: other_table_id1, state. A few updates for the start and end columns, but the rest are immutable. (Astatine is an enumerated type for the state of a column.)

I am wondering if it makes sense to split the two most frequently updated columns into a separate table. What I hope to get is performance when I just look at this information or reduce the weight of updates, because (maybe?) Reading and writing a shorter line is less expensive. But I need to weigh this against the cost of joins when they are (sometimes) necessary so that all the data for a particular item is right away.

At some point, I got the impression that each column is stored separately. But later I changed my mind when I read somewhere that decreasing the width of a column on one side of a table has a positive effect on performance when searching for data using another column (since the row is stored together, so the total row length will be shorter). Therefore, I now have the impression that all the data for the row is physically stored together on disk; therefore, the proposed splitting of the table sounds as if it would be useful. When I am currently writing 4 bytes to update the state, can I rewrite 64 bytes of text (name, type) that actually never change?

I am not very good at โ€œnormalizingโ€ a table and am not familiar with Postgres internal structures, so Iโ€™m looking for tips and tricks to evaluate a compromise without first having to do the work and then determine if the work was worthwhile. This change will require considerable effort to rewrite queries that have already been optimized, so I would like to better understand what result I can expect. Thank you, m.

+4
source share
3 answers

There is a certain cost to updating a larger row.

Formula can help with this. If you donโ€™t share your costs

Cost = xU + yS

Where:

U = update the entire row (the table is not split)

S = cost of choice

x, y = number of actions

Then, if you split it, you try to figure it out:

Cost = gU1 + hU2 + xS1 + yS2

Where

U1 = updating a smaller table (lower cost)

U2 = updating a larger table (lower cost)

S1 = select from smaller table

S2 = select from a larger table

g, h, x, y = how often individual actions occur

So, if g โ†’ h, he pays to smash them. Especially if x โ†’ y, then he really pays.

EDIT:. In response to the comments, I would also like to note that these costs become much more important if the database is under a steady load of inaction. If instead the server does not experience a steady load, it is basically inactive with only 1 or 2 trx per second with long segments (where "long" = a few seconds) of inactivity, then if it were me, I would not complicate my code, because a performance advantage will not be a real measurable thing.

+4
source

One of the details of the Postgresql implementation that is relevant to this is that it never "updates" the lines stored on disk, it always writes new versions. Thus, there is no quick gain due to the presence of fixed-width columns at the beginning, for example, with Oracle, for example (iirc).

True, grouping columns in different tables based on whether they can be updated together can lead to less garbage that needs to be cleaned. Experiments and measurement results are important here. If you have some data that is frequently updated, you should, for example, examine the "fillfactor" setting in the table. This option allows PostgreSQL to leave free space on the pages of the table upon insertion, allowing you to add updated versions of rows to the same page as the previous one, where possible: this can reduce the load on the update, as this may mean indexes pointing to the row do not need to be updated , due to the fact that the table takes up more disk space in general.

As mentioned in Xaade, there is a lot of material to deal with this issue. I would like to reinforce my comment on the need to measure the impact of any changes made. Sometimes what may seem like a big victory is not in practice.

+2
source

It is worth splitting it, no matter how the columns are stored. You would have less problems with concurrency, speed up the search on partial data, speed up the index search by providing three indexes to search without having to do these secondary keys, etc.

You can reduce the influence of internal joins either by cheating, or just by allowing as many rows to be scanned at a time. You can cheat by providing an interface, rather than allowing a direct search, displaying only internally connected data on visible lines (you can only see as many lines on the screen at a time) or by displaying additional data for the currently selected line or only for X lines to search using the browse buttons. If you use a cheat, make sure that you cache the results of the advanced search.

0
source

Source: https://habr.com/ru/post/1338141/


All Articles