Hierarchy of files for storing images on a social networking website?

What type of file system is useful for storing images on a social network website for about 50 thousand users?

I want to say how to create a directory? What should be the hierarchy of folders for storing images (for example, an album or a user).

I know that Facebook uses haystack now, but before that it uses simple NFS. What is the NFS hierarchy?

+4
source share
1 answer

There is no β€œbetter” way to do this from the point of view of file systems β€” for example, in NFS there is no established β€œhierarchy” other than the directories you create in the general NFS section where you write photos.

Each type of base file system (not NFS, I mean the server-side file system with which you should use NFS to serve files) has its own performance characteristics, but they will probably all be relatively fast ( O(1) or at least O(log(n)) ) to search for files in a directory. For this reason, you can basically make any directory structure you want and get "not terrible" performance. Therefore, you must make a decision based on what makes recording and maintaining your application easier, especially since you have a relatively small number of users right now.

However, if I were trying to solve this problem and wanted to use a relatively simple solution, I would probably give each photo a long random number in hexadecimal format (for example, b16eabce1f694f9bb754f3d84ba4b73e ) or use the checksum of the photo (for example, as output from md5 launch / md5sum in the photo file, for example 5983392e6eaaf5fb7d7ec95357cf0480 ), and then divide it into the directory prefix and the suffix "filename", for example 5983392e6/eaaf5fb7d7ec95357cf0480.jpg . Choosing how far you can increase the number to create a split will determine how many files you get in each directory. Then I would save the number / checksum as a column in the database table that you use to track uploaded photos.

The trade-offs between these two approaches are mainly related to performance: generating random numbers is much faster than performing checksums, but checksums make it possible to notice that several identical photos were uploaded and stored in the repository (if this can be common on your site, about which I have no idea about :-)). Cryptographically secure checksums also produce very well-distributed values, so you can be sure that you will not get artificially large numbers of photos in one particular directory (even if the hacker knows which checksum algorithm you use).

If you ever find that your exact split point will no longer scale because it requires too many files for each directory, you can simply add another level of directory nesting, for example, switching from 5983392e6/eaaf5fb7d7ec95357cf0480.jpg 5983392e6/eaaf5fb7/d7ec95357cf0480.jpg . In addition, if your only NFS server can no longer handle the load, you can use the prefix to distribute photos to multiple NFS servers, and not just across multiple directories.

0
source

Source: https://habr.com/ru/post/1384872/


All Articles