The fastest way to read files in a multiprocessing environment? FROM#

Question

The fastest way to read files in a multiprocessing environment? FROM#

I have the following task:

I have the role of a working azure cloud with many instances. Each minute, each instance combines about 20-30 threads. In each thread, you need to read some metadata on how to handle a stream of three objects. Objects / data is located in a remote RavenDb, and although RavenDb very quickly retrieves objects via HTTP, it is still under a significant load of 30+ workers who click it 3 times per stream per minute (about 45 requests / s). In most cases (for example, 99.999%), the data in RavenDb does not change.

I decided to implement local storage caching. First, I read a tiny record that indicates whether metadata has changed (it rarely changes), and then I read from the local file storage, not RavenDb, if the local storage has a cache object. I am using File.ReadAllText ()

This approach seems to clog the car, and the slow deceleration process slows down. I assume that drives with "small" working roles are not fast enough.

Anyway, can I get the OS to help me and cache these files? Perhaps there is an alternative to caching this data?

I view about ~ 1000 files of various sizes ranging in size from 100 to 10 MB, stored in each instance of the role in the cloud

+6

multithreading c # caching ravendb

Igorek Dec 21 '16 at 23:08

source share

1 answer

Kenneth · Accepted Answer · 2016-12-22T00:06:38+0000

Not a direct answer, but three possible options:

Use the built-in caching mechanism of RavenDB

My initial guess is that your caching mechanism is actually hurting performance. The RavenDB client has built-in caching (see here how to configure it: https://ravendb.net/docs/article-page/3.5/csharp/client-api/how-to/setup-aggressive-caching )

The problem is that the cache is local to each server. If server A uploaded the file earlier, server B will still need to extract it if this happens in order to process this file next time.

One option that you could implement is to split the workload. For instance:

Server A => fetch files starting with AD
Server B => fetch files starting with EH
Server C => ...

This will provide cache optimization on each server.

Get a bigger car

If you still want to use your own caching mechanism, there are two things that I think might be the bottleneck:

Disk access
JSON destabilization

For these problems, the only thing I can imagine is to get more resources:

If it's a drive, use premium storage with an SSD.
If this is deserialization, get a VM with a large processor

RAM Cache Files

Alternatively, instead of writing files to disk, save them in memory and get a virtual machine with a large amount of RAM. You do not need THIS a lot of RAM, since 1000 files * 10 MB are still only 1 GB. This will eliminate disk access and deserialization.

But in the end, it’s best to first determine where the bottleneck is and see if it can be mitigated using the built-in RavenDB caching mechanism.

The fastest way to read files in a multiprocessing environment? FROM#

More articles: