Clojure Time Series Analysis

I have a large data set (200 GB uncompressed, 9 GB compressed in bz2 -9) stock data.

I want to start the analysis of the main time series.

My machine has 16 GB of RAM.

I would prefer:

  • save all data compressed in memory

  • unzip this data on the fly and sink it [so that nothing ever gets to disk]

  • do all the analysis in memory

Now I think that there are nice interactions with Clojure laziness and future objects (i.e. I can define st objects, when I try to access them, I will unpack them on the fly.)

Question: What things should I keep in mind when analyzing high performance time series in Clojure?

I am particularly interested in tricks involving:

  • efficient storage of tick data in memory

  • efficient computing

  • weird convolutions to reduce # data passes

Suggestions of books / articles / research articles are welcome. (I am a PhD student).

Thanks.

+4
source share
2 answers

Some ideas:

  • As for storing compressed data, I don’t think you can do much better than caching your OS file system on your own. Just make sure it is configured to use 11 GB + RAM to cache the file system, and it must pull the entire compressed data set into memory since it is being read for the first time.
  • You can then define your Clojure code to lazily pull in data using a ZipInputStream that will decompress for you.
  • If you need to do a second pass of data, just create a new ZipInputStream in the same file. OS level caching should ensure that you no longer hit the disk.
+3
source

I have heard of systems like Java. Maybe. You, of course, will want to understand how to create your own lazy sequences to achieve this. I will also not hesitate to jump into Java if you need to make sure that you are dealing with primitive types that you want to deal with. for example, Clojure will not generate code for doing math in 32-bit ints, it will only generate code for working with longs, and if you do not want this to be a pain.

It would also be advisable to make your format in a format compatible with the disk format. This will give you the option of memory mapping files, or at least make your startup easy if your program crashes. for example, it can simply read files on a disk to restore it to its previous state.

+1
source

Source: https://habr.com/ru/post/1433478/


All Articles