Is this the correct behavior for read_csv and the data value is NA?

(I opened issue on GitHub.)

The following behavior does not seem right to me. It seems that if the default value for read_csvis equal na_values=False, then no values, including "NA", should be interpreted as NaN, but this is not so.

This behavior was seen in this post (see comments on @JianxunLi's answer), where "NA" actually means "North America." Actually, I cannot find a way to read this without changing it to NaN, and there must definitely be some way to do this.

Here is an example of csv.

%more foo.txt
x,y
"NA",NA
"foo",foo

I include "NA" both in quotation marks and externally to find out if that matters, but as you can see below, it doesn't look like that.

pd.read_csv('foo.txt')
Out[56]: 
     x    y
0  NaN  NaN
1  foo  foo

pd.read_csv('foo.txt',na_values=False)
Out[57]: 
     x    y
0  NaN  NaN
1  foo  foo

pd.read_csv('foo.txt',na_values='foo')
Out[58]: 
    x   y
0 NaN NaN
1 NaN NaN

It appears that the data values ​​of "NaN" are processed in the same way as "NA".

Edit to add: I think I better understand this based on @ Marius's answer, although it really doesn't seem right to me (the default behavior, that is, not Marius's answer, which seems to be the correct explanation of what is happening).

na_values=False    =>   NA and NaN are treated as NaN
na_values='foo'    =>   NA, NaN, and foo are treated as NaN

I think I can understand that this is the default behavior in a column of numbers, but it looks like it should not be the default for a row column. I would also really like to understand this from the documentation without seeing Marius' answer.

Change to add (2):

, , Stata Excel, "NA" , NaN/missing. , , pandas ?

+4
1

keep_default_na=False, , na_values, NA, . NA, NaN:

pd.read_csv('foo.txt', keep_default_na=False)
Out[5]: 
     x    y
0   NA   NA
1  foo  foo
+3

Source: https://habr.com/ru/post/1598786/


All Articles