An efficient way to store a list of URLs

I need to save a trillion list of URLs, where each list of URLs will contain ~ 50 URLs. What will be the most economical way to compress data for storage on disk.

I thought about first deleting useless information like "http: //" and then building a minimal state machine and saving that.

Another option is to create a comma-separated string of URLs and compress this string using regular compression such as GZIP or BZ2.

If I don't need speed, which solution will lead to better compression.

+4
source share
2 answers

URL- , , . , . :

http://www.google.com/search?q=hello+world (42 ) ==

http://= > 1 WWW. = > 2 google.com = > 3 = > 4 hello = > 5 world = > 6

URL- : 1,2,3, '/', 4, '?', 'q', '=', 5, '+', 6

, URL- ( ), , , ( 50000 , 70000 ).

, .

O (n) O (nlogn) URL- .

+1

, GZIP- , !

0

Source: https://habr.com/ru/post/1530666/


All Articles