The problem that prompts this question is related to the creation of ginormous inverted indices, similar to those used to create IR-systems. A common mantra from the IR community is that a relational database is not suitable for building IR systems. In any case, if you look at posgres, the overhead of a tuple is 23 bytes + filling (see "How much database disk space is required to store data from a regular text file?" In Postgres Frequently Asked Questions ). This is prohibitive (without scaling) for my work.
By the way, my data set is 17 lines of text, requiring 4-5 tables, depending on how the problem is cut. I remember trying a schema in sqlite, and the db file broke 100 gigs.
I am very interested to know what overhead for each row for Sql Server / MySql / Sqlite / Berkeley db (all its access methods) / Berkley Db sqlite3 interface / Kyoto, Tokyo db and Firebird. Anyone will not be able to answer the question, I think if someone was not as curious as I was to look at it.
change
- Postgres - 23 (OMG!) Byte header + padding.
- bdb-hash: 26-byte page overhead, 6-byte key / overhead data (Combined).
- Bdb-btree: 26-byte page overhead, 10-byte key / overhead (Combined).
- MySql Innodb: analyzed here (5-byte header + transaction id + turn indicator = 18 per line afaik) note-to-self: why does the transaction id appear on disk? What are roll pointers?
- Sql server: from here . It captures the lengths of element variants; rows with static data types have very modest overheads. overhead estimates are highly dependent on the nature of the scheme and data. Overhead increases the larger element of the option.
source share