How can I get closer to the size of the data structure in scala?

I have a query that returns about 6 million rows to me, which is too large to process everything at once in memory.

Each query returns Tuple3 [String, Int, java.sql.Timestamp]. I know that a string is no more than 20 characters, UTF8.

How can I determine the maximum size of one of these tuples and, more generally, how can I get closer to the size of the scala data structure like this?

I have 6Gb on the machine I use. However, data is read from the database using scala -query in scala Lists.

+6
source share
2 answers

Scala objects follow roughly the same rules as Java objects, so any information about them is accurate. Here is one source that at least is mostly suitable for 32-bit JVMs. (64-bit JVMs use 8 bytes per pointer, which usually works with extra overhead of 4 bytes plus 4 bytes per pointer, but may be less if the JVM uses compressed pointers, which, by default, I now think.)

I assume a 64-bit machine without compressed pointers (worst case); then a Tuple3 has two pointers (16 bytes) plus plus Int (4 bytes) plus object overhead (~ 12 bytes), rounded to the nearest 8 or 32 bytes, plus an additional object (8 bytes) as a stub for the non-specialized version of Int . (Unfortunately, if you use primitives in tuples, they take up even more space than using wrapped versions.). String - 32 bytes, IIRC, plus an array for data of 16 plus 2 per character. java.sql.Timestamp needs to save a Long pair (I think it is), so 32 bytes. Everything is said that it is about 120 bytes plus two per character, which is ~ 160 bytes for ~ 20 characters.

Alternatively, see this answer for a way to measure the size of your objects directly. When I measure it this way, I get 160 bytes (and my score above was adjusted using this data to fit, I used to have a few small errors).

+6
source

How much memory do you have at your disposal? 6 million copies of the trio is really not very much!

Each link has overhead, which is either 4 or 8 bytes, depending on whether you use 32-bit or 64-bit (without oops compression, although this is the default value in JDK7 for heap under 32Gb).

So, your triple has 3 links (maybe extra due to specialization - so you can get 4 refs), your Timestamp is a wrapper (link) around long (8 bytes). Your Int will be specialized (i.e. Core Int ), so this does 4 more bytes. String 20 x 2 bytes. So you basically have the worst case of less than 100 bytes per line; therefore 10 lines per kilobyte, 10,000 lines per MB. Thus, you can comfortably process your 6 million lines under 1 GB of heap.

Honestly, I think I made a mistake here because we process several million lines every day about twenty fields (including decimal numbers, lines, etc.) in this space.

+2
source

Source: https://habr.com/ru/post/919038/


All Articles