Matlab: How to read decimal separator numbers?

I have many (hundreds of thousands) fairly large (> 0.5 MB) files, where the data is numerical, but with a comma as a decimal separator. It is not practical to use an external tool, for example sed "s/,/./g" . When the separator is a textscan(fid, '%f%f%f') , I just use textscan(fid, '%f%f%f') , but I don’t see the possibility of changing the decimal separator. How can I read such a file in an efficient way?

Example line from file:

 5,040000 18,040000 -0,030000 

Note. There is a similar question for R , but I am using Matlab.

+4
source share
4 answers

With a test script, I found a coefficient of less than 1.5. My code would look like this:

 tmco = {'NumHeaderLines', 1 , ... 'NumColumns' , 5 , ... 'ConvString' , '%f' , ... 'InfoLevel' , 0 , ... 'ReadMode' , 'block', ... 'ReplaceChar' , {',.'} } ; A = txt2mat(filename, tmco{:}); 

Note the other ReplaceChar value and the ReadMode block.

I get the following results for a 5 MB file on my (not too new) machine:

  • txt2mat test comma avg. time: 0.63231
  • txt2mat test dot avg. Time: 0.45715
  • textscan test dot avg. time: 0.4787

Full code of my test script:

 %% generate sample files fdot = 'C:\temp\cDot.txt'; fcom = 'C:\temp\cCom.txt'; c = 5; % # columns r = 100000; % # rows test = round(1e8*rand(r,c))/1e6; tdot = sprintf([repmat('%f ', 1,c), '\r\n'], test.'); % ' tdot = ['a header line', char([13,10]), tdot]; tcom = strrep(tdot,'.',','); % write dot file fid = fopen(fdot,'w'); fprintf(fid, '%s', tdot); fclose(fid); % write comma file fid = fopen(fcom,'w'); fprintf(fid, '%s', tcom); fclose(fid); disp('-----') %% read back sample files with txt2mat and textscan % txt2mat-options with comma decimal sep. tmco = {'NumHeaderLines', 1 , ... 'NumColumns' , 5 , ... 'ConvString' , '%f' , ... 'InfoLevel' , 0 , ... 'ReadMode' , 'block', ... 'ReplaceChar' , {',.'} } ; % txt2mat-options with dot decimal sep. tmdo = {'NumHeaderLines', 1 , ... 'NumColumns' , 5 , ... 'ConvString' , '%f' , ... 'InfoLevel' , 0 , ... 'ReadMode' , 'block'} ; % textscan-options tsco = {'HeaderLines' , 1 , ... 'CollectOutput' , true } ; A = txt2mat(fcom, tmco{:}); B = txt2mat(fdot, tmdo{:}); fid = fopen(fdot); C = textscan(fid, repmat('%f',1,c) , tsco{:} ); fclose(fid); C = C{1}; disp(['txt2mat test comma (1=Ok): ' num2str(isequal(A,test)) ]) disp(['txt2mat test dot (1=Ok): ' num2str(isequal(B,test)) ]) disp(['textscan test dot (1=Ok): ' num2str(isequal(C,test)) ]) disp('-----') %% speed test numTest = 20; % A) txt2mat with comma tic for k = 1:numTest A = txt2mat(fcom, tmco{:}); clear A end ttmc = toc; disp(['txt2mat test comma avg. time: ' num2str(ttmc/numTest) ]) % B) txt2mat with dot tic for k = 1:numTest B = txt2mat(fdot, tmdo{:}); clear B end ttmd = toc; disp(['txt2mat test dot avg. time: ' num2str(ttmd/numTest) ]) % C) textscan with dot tic for k = 1:numTest fid = fopen(fdot); C = textscan(fid, repmat('%f',1,c) , tsco{:} ); fclose(fid); C = C{1}; clear C end ttsc = toc; disp(['textscan test dot avg. time: ' num2str(ttsc/numTest) ]) disp('-----') 
+4
source

You can use txt2mat .

 A = txt2mat('data.txt'); 

It will process the data automatically. But you can directly say:

 A = txt2mat('data.txt','ReplaceChar',',.'); 

PS It may be inefficient, but you can copy a part from the source file if you need it only for your specific data formats.

0
source

You can try to speed up txt2mat by adding also the number of header lines and, if possible, the number of columns as input to bypass its file analysis. Then there should not be the 25th compared to importing textscan with decimal points. (You can also contact me using the author’s page on mathworks.) Please let us know if you find a more efficient way to handle comma decimal places in Matlab.

0
source

My solution (it is assumed that commas are used only as holders of decimal places and that empty space is allocated by columns):

 fid = fopen("FILENAME"); indat = fread(fid, '*char'); fclose(fid); indat = strrep(indat, ',', '.'); [colA, colB] = strread(indat, '%f %f'); 

If you need to delete one header line like me, then this should work:

 fid = fopen("FILENAME"); %Open file indat = fread(fid, '*char'); %Read in the entire file as characters fclose(fid); %Close file indat = strrep(indat, ',', '.'); %Replace commas with periods endheader=strfind(indat,13); %Find first newline indat=indat(endheader+1:size(indat,2)); %Extract all characters after first new line [colA, colB] = strread(indat, '%f %f'); %Convert string to numerical data 
0
source

Source: https://habr.com/ru/post/1382168/


All Articles