Numpy Converts a string representation of a boolean matrix to a logical array

Is there a numpy native way to convert an array of string representations of boolean types, for example:

['True','False','True','False'] 

For a real boolean array, which can I use for masking / indexing? I could do a for loop and rebuild the array, but for large arrays this is slower.

+6
source share
3 answers

You should be able to do a logical comparison, IIUC, whether the dtype string or an object :

 >>> a = np.array(['True', 'False', 'True', 'False']) >>> a array(['True', 'False', 'True', 'False'], dtype='|S5') >>> a == "True" array([ True, False, True, False], dtype=bool) 

or

 >>> a = np.array(['True', 'False', 'True', 'False'], dtype=object) >>> a array(['True', 'False', 'True', 'False'], dtype=object) >>> a == "True" array([ True, False, True, False], dtype=bool) 
+6
source

I found a method that is even faster than DSM, inspired by Eric, although the improvement is best seen with smaller value lists; at very large values, the cost of the iteration itself begins to outweigh the advantage of performing a truth check during the creation of the numpy array, and not after. Testing with is and == (for situations where strings are interned compared to when they may be absent, since is will not work with non-integer strings. Since 'True' is likely to be a literal in the script, it should be interned, although ) showed that although my version with == was slower than with is , it was much faster than the DSM version.

Test setup:

 import timeit def timer(statement, count): return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count) >>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)" >>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)" >>> stateDSM = "y = np.array(x) == 'True'" 

With 1000 titles, faster operators take up about 66% of DSM time:

 >>> timer(stateIs, 1000) [101.77722641656146, 100.74985342340369, 101.47228618107965] >>> timer(stateEq, 1000) [112.26464996250706, 112.50754567379681, 112.76057346127709] >>> timer(stateDSM, 1000) [155.67689949529995, 155.96820504501557, 158.32394669279802] 

For smaller row arrays (in hundreds, not thousands), elapsed time is less than 50% of DSM:

 >>> timer(stateIs, 100) [11.947757485669172, 11.927990253608186, 12.057855628259858] >>> timer(stateEq, 100) [13.064947253943501, 13.161545451986967, 13.30599035623618] >>> timer(stateDSM, 100) [31.270060799078237, 30.941749748808434, 31.253922641324607] 

A little over 25% DSM when done with 50 items on the list:

 >>> timer(stateIs, 50) [6.856538342483873, 6.741083326021908, 6.708402786859551] >>> timer(stateEq, 50) [7.346079345032194, 7.312723444475523, 7.309259899921017] >>> timer(stateDSM, 50) [24.154247576229864, 24.173593700599667, 23.946403452288905] 

For 5 items, about 11% DSM:

 >>> timer(stateIs, 5) [1.8826215278058953, 1.850232652068371, 1.8559381315990322] >>> timer(stateEq, 5) [1.9252821868467436, 1.894011299061276, 1.894306935199893] >>> timer(stateDSM, 5) [18.060974208809057, 17.916322392367874, 17.8379771602049] 
+2
source

Is this enough?

 my_list = ['True', 'False', 'True', 'False'] np.array(x == 'True' for x in my_list) 

This is not native, but if you start with a non-native list anyway, it really doesn't matter.

0
source

Source: https://habr.com/ru/post/946605/


All Articles