I need to save a trillion list of URLs, where each list of URLs will contain ~ 50 URLs. What will be the most economical way to compress data for storage on disk.
I thought about first deleting useless information like "http: //" and then building a minimal state machine and saving that.
Another option is to create a comma-separated string of URLs and compress this string using regular compression such as GZIP or BZ2.
If I don't need speed, which solution will lead to better compression.
URL- , , . , . :
http://www.google.com/search?q=hello+world (42 ) ==
http://= > 1 WWW. = > 2 google.com = > 3 = > 4 hello = > 5 world = > 6
URL- : 1,2,3, '/', 4, '?', 'q', '=', 5, '+', 6
, URL- ( ), , , ( 50000 , 70000 ).
, .
O (n) O (nlogn) URL- .
, GZIP- , !
Source: https://habr.com/ru/post/1530666/More articles:JS Frontend ↔ Бэкэнд-связь на С++ (разные серверы) - c++Ограничить механизм просмотра для доступа к некоторым пространствам имен в Microsoft MVC - asp.net-mvcCreating a structured array from a flat list - numpyNavigation controller shows black screen - iosadd columns to excel - vbaCommunicating with the Windows7 API - c #Черепаха Git, отмените создание репо здесь - gitHow to make JDialog not always on top of the parent - javaЕсть ли что-то лучше, чем dict.get(, dict.get())? - pythonDetermine the number of monitors - c #All Articles