An efficient way to store a list of URLs

Question

An efficient way to store a list of URLs

I need to save a trillion list of URLs, where each list of URLs will contain ~ 50 URLs. What will be the most economical way to compress data for storage on disk.

I thought about first deleting useless information like "http: //" and then building a minimal state machine and saving that.

Another option is to create a comma-separated string of URLs and compress this string using regular compression such as GZIP or BZ2.

If I don't need speed, which solution will lead to better compression.

+4

algorithm data-structures compression finite-automata

skyde Mar 07 '14 at 19:50

source share

2 answers

Samy Arous · Answer 1 · 2014-03-07T23:40:41+0000

URL- , , . , . :

http://www.google.com/search?q=hello+world (42 ) ==

http://= > 1 WWW. = > 2 google.com = > 3 = > 4 hello = > 5 world = > 6

URL- : 1,2,3, '/', 4, '?', 'q', '=', 5, '+', 6

, URL- ( ), , , ( 50000 , 70000 ).

, .

O (n) O (nlogn) URL- .

skyde · Answer 2 · 2014-03-07T23:07:36+0000

, GZIP- , !

An efficient way to store a list of URLs

More articles: