Converting a list of strings in a numpy array in a faster way

br is the name of a string list that looks like this:

 ['14 0.000000 -- (long term 0.000000)\n', '19 0.000000 -- (long term 0.000000)\n', '22 0.000000 -- (long term 0.000000)\n', ... 

I am interested in the first two columns that I would like to convert to a numpy array. So far I have come up with the following solution:

 x = N.array ([0., 0.]) for i in br: x = N.vstack ( (x, N.array (map (float, i.split ()[:2]))) ) 

The result is a two-dimensional array:

 array([[ 0., 0.], [ 14., 0.], [ 19., 0.], [ 22., 0.], ... 

However, since br quite large (~ 10 ^ 5 entries), this procedure takes some time. I was wondering if there is a way to achieve the same result, but in less time?

+6
source share
3 answers

This is significantly faster for me:

 import numpy as N br = ['14 0.000000 -- (long term 0.000000)\n']*50000 aa = N.zeros((len(br), 2)) for i,line in enumerate(br): al, strs = aa[i], line.split(None, 2)[:2] al[0], al[1] = float(strs[0]), float(strs[1]) 

Changes:

  • To predefine a numpy array (this is big). You already know that you need a 2-dimensional array with certain sizes.
  • Only split () for the first two columns, since you don't need the rest.
  • Do not use map (): it is slower than lists. I didnโ€™t even use lists because you know that you have only 2 columns.
  • Assign directly to a pre-allocated array instead of generating new temp arrays on repeat.
+4
source

You can try the preprocess (with awk for example) a list of strings if they come from a file, and use numpy.fromtxt. If you canโ€™t do anything about how you get this list, you have several options:

  • refuse. You will run this function once a day. You don't care about speed, and your actual solution is good enough.
  • write an IO plugin using cython. You have a big potential gain, because you can do all the cycles in c and directly influence the values โ€‹โ€‹in a big (10 ^ 5, 2) numpy ndarray
  • try a different language to fix your problem. If you use languages โ€‹โ€‹like c or haskell, you can use ctypes to call functions compiled in a dll from python

change

Perhaps this approach is a bit faster:

 def conv(mysrt): return map(float, mystr.split()[:2]) br_float = map(conv, br) x = N.array(br_float) 
+2
source

Change

 map (float, i.split()[:2]) 

to

 map (float, i.split(' ',2)[:2]) 

may cause slight acceleration. Since you canโ€™t do just the first two elements separated by spaces, there is no need to separate the entire line in each line. 2 in i.split(' ',2) tells split simply make a maximum of 2 split . For instance,

 In [11]: x='14 0.000000 -- (long term 0.000000)\n' In [12]: x.split() Out[12]: ['14', '0.000000', '--', '(long', 'term', '0.000000)'] In [13]: x.split(' ',2) Out[13]: ['14', '0.000000', '-- (long term 0.000000)\n'] 
+1
source

Source: https://habr.com/ru/post/896335/


All Articles