Skip rows with missing values ​​in read_csv

I have a very big csv that I need to read. To do this quickly and save RAM, I use read_csv and set the dtype of some columns to np.uint32. The problem is that some lines have missing values, and pandas uses float to represent them.

  • Is it possible to simply skip lines with missing values? I know that I can do this after reading the entire file, but that means that I could not install dtype until then, and therefore will use too much RAM.
  • Is it possible to convert the missing values ​​to others that I choose when reading data?
+4
source share
3 answers

It would be lacquered if you could fill in NaNwords 0while reading. Perhaps the function request in Pandas git-hub is fine ...

Using the converter function

However, for now, you can define your own function for this and pass it to the argument convertersin read_csv:

def conv(val):
    if val == np.nan:
        return 0 # or whatever else you want to represent your NaN with
    return val

df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)

Note that it convertersaccepts dict, so you need to specify it for each column that has NaN. This can become a little tedious if many columns are affected. You can specify column names or column numbers as keys.

, read_csv , converters. , , NaN , lambda:

df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)

, , . . :

result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
    chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
    chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
    result = result.append(chunk)
del df, chunk

, . , chunk , result.append, chunksize, . , .

+6

Pandas , . Python :

import csv
import pandas as pd

def filter_records(records):
    """Given an iterable of dicts, converts values to int.
    Discards any record which has an empty field."""

    for record in records:
        for k, v in record.iteritems():
            if v == '':
                break
            record[k] = int(v)
        else: # this executes whenever break did not
            yield record

with open('t.csv') as infile:
    records = csv.DictReader(infile)
    df = pd.DataFrame.from_records(filter_records(records))

Pandas csv. , , Cython ( Pandas).

0

, SO ppl .

pd.read_csv('FILE', keep_default_na=False)

:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’.

keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.

na_filter : boolean, default True
    Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file
-1

Source: https://habr.com/ru/post/1650483/


All Articles