Skip rows with missing values in read_csv

Question

Skip rows with missing values in read_csv

I have a very big csv that I need to read. To do this quickly and save RAM, I use read_csv and set the dtype of some columns to np.uint32. The problem is that some lines have missing values, and pandas uses float to represent them.

Is it possible to simply skip lines with missing values? I know that I can do this after reading the entire file, but that means that I could not install dtype until then, and therefore will use too much RAM.
Is it possible to convert the missing values to others that I choose when reading data?

+4

python pandas

eleanora Aug 7 '16 at 21:12

source share

3 answers

Pandas , . Python :

import csv
import pandas as pd

def filter_records(records):
    """Given an iterable of dicts, converts values to int.
    Discards any record which has an empty field."""

    for record in records:
        for k, v in record.iteritems():
            if v == '':
                break
            record[k] = int(v)
        else: # this executes whenever break did not
            yield record

with open('t.csv') as infile:
    records = csv.DictReader(infile)
    df = pd.DataFrame.from_records(filter_records(records))

Pandas csv. , , Cython ( Pandas).

0

John Zwinck 07 . '16 22:26

, SO ppl .

pd.read_csv('FILE', keep_default_na=False)

:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’.

keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.

na_filter : boolean, default True
    Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file

-1

Merlin 07 . '16 22:08

Kartik · Accepted Answer · 2016-08-07T22:52:52+0000

It would be lacquered if you could fill in NaNwords 0while reading. Perhaps the function request in Pandas git-hub is fine ...

Using the converter function

However, for now, you can define your own function for this and pass it to the argument convertersin read_csv:

def conv(val):
    if val == np.nan:
        return 0 # or whatever else you want to represent your NaN with
    return val

df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)

Note that it convertersaccepts dict, so you need to specify it for each column that has NaN. This can become a little tedious if many columns are affected. You can specify column names or column numbers as keys.

, read_csv , converters. , , NaN , lambda:

df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)

, , . . :

result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
    chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
    chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
    result = result.append(chunk)
del df, chunk

, . , chunk , result.append, chunksize, . , .

Skip rows with missing values ​​in read_csv

Using the converter function

More articles:

Skip rows with missing values in read_csv