The fastest way to read files in a multiprocessing environment? FROM#

I have the following task:

I have the role of a working azure cloud with many instances. Each minute, each instance combines about 20-30 threads. In each thread, you need to read some metadata on how to handle a stream of three objects. Objects / data is located in a remote RavenDb, and although RavenDb very quickly retrieves objects via HTTP, it is still under a significant load of 30+ workers who click it 3 times per stream per minute (about 45 requests / s). In most cases (for example, 99.999%), the data in RavenDb does not change.

I decided to implement local storage caching. First, I read a tiny record that indicates whether metadata has changed (it rarely changes), and then I read from the local file storage, not RavenDb, if the local storage has a cache object. I am using File.ReadAllText ()

This approach seems to clog the car, and the slow deceleration process slows down. I assume that drives with "small" working roles are not fast enough.

Anyway, can I get the OS to help me and cache these files? Perhaps there is an alternative to caching this data?

I view about ~ 1000 files of various sizes ranging in size from 100 to 10 MB, stored in each instance of the role in the cloud

+6
source share
1 answer

Not a direct answer, but three possible options:

Use the built-in caching mechanism of RavenDB

My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has built-in caching (see here how to configure it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching )

The problem is that the cache is local to each server. If server A uploaded the file earlier, server B will still need to extract it if this happens in order to process this file next time.

One option that you could implement is to split the workload. For instance:

  • Server A => fetch files starting with AD
  • Server B => fetch files starting with EH
  • Server C => ...

This will provide cache optimization on each server.

Get a bigger car

If you still want to use your own caching mechanism, there are two things that I think might be the bottleneck:

  • Disk access
  • JSON destabilization

For these problems, the only thing I can imagine is to get more resources:

  • If it's a drive, use premium storage with an SSD.
  • If this is deserialization, get a VM with a large processor

RAM Cache Files

Alternatively, instead of writing files to disk, save them in memory and get a virtual machine with a large amount of RAM. You do not need THIS a lot of RAM, since 1000 files * 10 MB are still only 1 GB. This will eliminate disk access and deserialization.

But in the end, it’s best to first determine where the bottleneck is and see if it can be mitigated using the built-in RavenDB caching mechanism.

+5
source

Source: https://habr.com/ru/post/1013406/


All Articles