Can I read a giant text file using Parallel Computing?

I have several text files about 2 GB in size (about 70 million lines). I also have a quad-core machine and access to the Parallel Computing toolkit.

You can usually open the file and read the lines like this:

f = fopen('file.txt'); l = fgets(f); while ~ isempty(l) % do something with l l = fgets(f); end 

I wanted to distribute "do something with l " across my 4 cores, but that of course requires the use of a parfor loop. This will require me to "slurp" a 2GB file (borrow the term Perl) in MATLAB a priori instead of on-the-fly processing. I really don't need l , just the processing result.

Is there a way to read lines from a text file with parallel computing?

EDIT: It's worth noting that I can find the exact number of lines ahead of time ( !wc -l mygiantfile.txt ).

EDIT2: The file structure is as follows:

 15 1180 62444 e0e0 049c f3ec 104 

So, 3 decimal numbers, 3 hexadecimal numbers and 1 decimal number. Repeat this for 70 million lines.

+4
source share
2 answers

As requested, I show an example memory mapped files using memmapfile .

Since you did not specify the exact data file format, I will create my own. The data I create is a table of N rows, each of which consists of 4 columns:

  • is the scalar value of double
  • second is a single value
  • third is a fixed-length string representing uint32 in HEX notation (for example: D091BB44 )
  • The fourth column is the uint8 value

Code to generate random data and write it to a binary file, structured as described above:

 % random data N = 10; data = [... num2cell(rand(N,1)), ... num2cell(rand(N,1,'single')), ... cellstr(dec2hex(randi(intmax('uint32'), [N,1]),8)), ... num2cell(randi([0 255], [N,1], 'uint8')) ... ]; % write to binary file fid = fopen('file.bin', 'wb'); for i=1:N fwrite(fid, data{i,1}, 'double'); fwrite(fid, data{i,2}, 'single'); fwrite(fid, data{i,3}, 'char'); fwrite(fid, data{i,4}, 'uint8'); end fclose(fid); 

Here is the resulting file viewed in the HEX editor:

binary file viewed in a hex editor

we can confirm the first entry (note that my system uses the byte order of bytes):

 >> num2hex(data{1,1}) ans = 3fd4d780d56f2ca6 >> num2hex(data{1,2}) ans = 3ddd473e >> arrayfun(@dec2hex, double(data{1,3}), 'UniformOutput',false) ans = '46' '35' '36' '32' '37' '35' '32' '46' >> dec2hex(data{1,4}) ans = C0 

Next, we open the file using memory mapping:

 m = memmapfile('file.bin', 'Offset',0, 'Repeat',Inf, 'Writable',false, ... 'Format',{ 'double', [1 1], 'd'; 'single', [1 1], 's'; 'uint8' , [1 8], 'h'; % since it doesnt directly support char 'uint8' , [1 1], 'i'}); 

Now we can access the records as a regular structural array:

 >> rec = m.Data; % 10x1 struct array >> rec(1) % same as: data(1,:) ans = d: 0.3257 s: 0.1080 h: [70 53 54 50 55 53 50 70] i: 192 >> rec(4).d % same as: data{4,1} ans = 0.5799 >> char(rec(10).h) % same as: data{10,3} ans = 2B2F493F 

The advantage is that for large data files you can limit the display of the β€œviewport” to a small subset of records and move this view along the file:

 % read the records two at-a-time numRec = 10; % total number of records lenRec = 8*1 + 4*1 + 1*8 + 1*1; % length of each record in bytes numRecPerView = 2; % how many records in a viewing window m.Repeat = numRecPerView; for i=1:(numRec/numRecPerView) % move the window along the file m.Offset = (i-1) * numRecPerView*lenRec; % read the two records in this window: %for j=1:numRecPerView, m.Data(j), end m.Data(1) m.Data(2) end 

access a portion of a file using memory-mapping

+2
source

Some Matlab built-in functions support multithreading - the list is here . No need for the Parallel Computing toolkit.

If "do something with l" can be extracted from the toolbar, just execute the function before reading another line.

You can also read the entire file using

 fid = fopen('textfile.txt'); C = textscan(fid,'%s','delimiter','\n'); fclose(fid); 

and then parallel calculate cells in C.


If read time is a key issue, you can also access parts of the data file in a parfor loop. Here is an example from Edrick M. Ellis .

 %Some data x = rand(1000, 10); fh = fopen( 'tmp.bin', 'wb' ); fwrite( fh, x, 'double' ); fclose( fh ); % Read the data y = zeros(1000, 10); parfor ii = 1:10 fh = fopen( 'tmp.bin', 'rb' ); % Get to the correct spot in the file: offset_bytes = (ii-1) * 1000 * 8; % 8 bytes/double fseek( fh, offset_bytes, 'bof' ); % read a column y(:,ii) = fread( fh, 1000, 'double' ); fclose( fh ); end % Check assert( isequal( x, y ) ); 
+2
source

Source: https://habr.com/ru/post/1500577/


All Articles