Implement a high-performance distributed file system / database

I need to implement the fastest way to store key / value pairs in a distributed system on Linux. Database entries average 256 bytes.

I am going to use the open (), write (), and read () system calls and write key-value pairs directly at some offset in the file. I can omit the fdatasync () system call since I will be using a battery SSD so I don’t have to worry about ACID if the system shuts down unexpectedly. Linux already provides disk cache implementation, so read / write will not occur in sectors that have already been loaded into memory. This (I think) would be the fastest way to store data, much faster than any other database engine that supports a cache, such as GT.M or Intersystem Globals.

However, the data is not replicated, and to achieve replication I can mount the file system of another Linux server with NFS and copy the data there, for example, if I have 2 data servers (1 local and 1 remote) I would issue 2 open () calls, 2 write () and 2 close (). If the transaction failed on the remote server, I would mark it as “not synchronized” and simply copy the good file again when the remote server returns.

What do you think of this approach? Will it be fast? I can use NFS over UDP, so I will avoid the expense of the TCP stack.

The list of benefits so far is as follows:

  • Reusing Linux Disk Cache
  • Few lines of code
  • High performance

I will encode this in C. To find the entry in the file, I will store btree in memory with a pointer to a physical location.

+4
source share
2 answers

Several considerations come to mind.

  • do I need to open () / write () / close () for each transaction? the utility function of the open () system call in particular is probably nontrivial

  • Can you use mmap () instead of explicit write () s?

  • if you make 2 write () calls (1 local, 1 NFS) for each transaction, it seems that any network problem (latency, dropped packets, etc.) has the potential for the application to pause if you expect the NFS call write () will succeed. And if you do not expect, for example, making NFS records from a separate stream, your complexity will grow rapidly (I do not think that “A few lines of code” will remain true.)

In general, I would suggest that you really prove that the available tools do not meet your performance requirements before choosing to re-create this particular wheel.

+3
source

You can explore the real distributed file system rather than using NFS, which, as you point out, still provides a single point of failure and lack of replication.

The Andrews File System (AFS), originally developed by CMU, may be the solution for you. This is a commercial product, but you can check OpenAFS , which runs on Linux (and other systems).

Warning: AFS has a learning curve.

+1
source

Source: https://habr.com/ru/post/1390288/


All Articles