Java: tips for handling large amounts of data. (Part Deux)

Good. Thus, I have a very large amount of binary data (say, 10 GB) distributed over a bunch of files (say, 5000) of various lengths.

I am writing a Java application to process this data, and I want to create a good design for accessing data. Usually the following will happen:

  • One way or another, all data will be read during processing.
  • Each file (usually) is read sequentially, requiring only a few kilobytes at a time. However, it is often necessary to have, say, the first few kilobytes of each file at the same time or several kilobytes of each file at the same time, etc.
  • There are times when an application will want random access to a byte or two here and there.

I am currently using the RandomAccessFile class to read into byte buffers (and ByteBuffers). My ultimate goal is to encapsulate data access in some class so that it is fast, and I no longer need to worry about it. The main functionality is that I will ask him to read data frames from certain files, and I want to minimize I / O operations, given the above considerations.

Typical Access Examples:

  • Give me the first 10 kilobytes of all my files!
  • Give me a byte from 0 to 999 of file F, then give me a byte from 1 to 1000, then give me from 2 to 1001, etc. etc.,...
  • Give me a megabyte of data from file F, starting with the same byte!

Any suggestions for a good design?

+4
source share
9 answers

Use Java NIO and MappedByteBuffers and treat your files as a list of byte arrays. Then let the OS worry about the details of caching, reading, cleaning, etc.

+9
source

@Will

Pretty good results. Reading a large binary:

  • Test 1 - Basic sequential reading with RandomAccessFile. 2656 ms

  • Test 2 - basic sequential read with buffering. 47 ms

  • Test 3 - basic sequential reading with MappedByteBuffers and further frame buffering optimization. 16 ms

+2
source

Wow. You basically implement the database from scratch. Is it possible to import data into a real DBMS and just use SQL?

If you do this yourself, you will eventually want to implement some kind of caching mechanism, so the data you need comes out of RAM, if there is one, and you read and write files at the lower level.

Of course, this also entails a lot of complex transactional logic to make sure your data remains consistent.

+1
source

I was about to suggest that you follow Eric's idea of ​​a database and learn how databases manage their buffers, effectively implementing their own virtual memory management.

But since I thought about it more, I came to the conclusion that most operating systems already perform better file system caching than you can probably do without low-level access in Java.

There is one lesson from database buffer management that you can consider. Databases use an understanding of the query plan to optimize management strategies.

In a relational database, it is often best to strip the most frequently used block from the cache. For example, a β€œyoung” block containing a child record in a connection will not be displayed again, while a block containing its parent record is still in use, even if it is β€œolder”.

On the other hand, operating system file caches are optimized for reusing recently used data (and reading before the last used data). If your application does not match this pattern, it may be worth managing the cache yourself.

+1
source

You might want to take a look at the open source database database called jdbm - there are a lot of this kind of development in it, including ACID capabilities.

I made a number of contributions to the project, and it would be useful to review the source code, if not for something else, to see how we solved many of the same problems that you could work on.

Now, if your data files are not under your control (i.e. you are analyzing text files created by someone else, etc.), then the paginated storage type that jdbm uses may not suit you . if all of these files are files that you create and work with, it might be worth a look.

+1
source

@Eric

But my queries will be much simpler than anything I can do with SQL. And will not access to the database be much more expensive than reading binary data?

0
source

This is the answer to the question of minimizing I / O traffic. On the Java side, all you really can do is wrap your readers in BufferedReaders. In addition, your operating system will handle other optimizations, such as storing recently read data in the page cache and performing open file read operations to speed up sequential reads. It makes no sense to do additional buffering in Java (although you still need a byte buffer to return data to the client).

0
source

I had someone recommended hadoop ( http://hadoop.apache.org ) just the other day. It seems like it can be pretty good, and may have some traction in the market.

0
source

I would take a step back and ask myself why you use files as your recording system, and what are the benefits of using a database. The database certainly gives you the ability to structure your data. Given the SQL standard, it may be more convenient to maintain in the long run.

On the other hand, your file data can be easily structured within database restrictions. The largest search company in the world :) does not use a database for its business processing. See here and here .

0
source

Source: https://habr.com/ru/post/1277223/


All Articles