Fread error "Undefined character termination field"

Could you help me?

I am trying to download a large TSV file (4 million lines) and use "fread" (great speed :)

The problem is that when a certain line is reached, all the program crashes. The last message from verbose is " The column Bumping 12 from INT64 to REAL in data line 2220004, the field contains" 0.54 ""

I tried to copy only to this line with the "skip" option - it worked fine, but after I tried to copy the last lines, it immediately reset another error: Unexpected character ("Am"), field 5 line 2220005

After I tried disabling the header to drop the 12th column to introduce column classes, nothing worked.

Any ideas how to overcome this problem?

My code is:

library(data.table)
movies <- fread('avito_train.tsv', verbose=TRUE, nrows=2220002)
movies2 <- fread('avito_train.tsv', verbose=TRUE, sep="\t", skip=2220004, colClasses=c("integer", "character", "character","character","character", "character","integer","integer","integer","integer","integer","real", "numeric")) 

Oh, if he changes something, the text with the tsv file is in the Slavic language.

+4
source share
1 answer

It works great for me using the latest version data.table from GitHub. Perhaps two recent changes to README have decided:

fread():
* . Clayton Stanley : fread
* . 2970844 : fread

( 4 , , ):

$ file avito_train.tsv 
avito_train.tsv: UTF-8 Unicode text, with very long lines

> DT = fread("Downloads/avito_train.tsv",verbose=TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 2.915 GB
File is opened and mapped ok
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep='\t'
Found 13 columns
First row with 13 fields occurs on line 1 (either column names or first row of data)
All the fields on line 1 are character fields. Treating as the column names.
Count of eol after first data row: 3995804
Subtracted 1 for last eol and any trailing empty lines, leaving 3995803 data rows
Type codes (   first 5 rows): 1444441111113
Type codes (+ middle 5 rows): 1444441111113
Type codes (+   last 5 rows): 1444441111113
Type codes: 1444441111113 (after applying colClasses and integer64)
Type codes: 1444441111113 (after applying drop or select (if supplied)
Allocating 13 column slots (13 - 0 dropped)
Read 3995803 rows and 13 (of 13) columns from 2.915 GB file in 00:10:49
  82.590s ( 13%) Memory map (rerun may be quicker)
   2.930s (  0%) sep and header detection
  68.290s ( 11%) Count rows (wc -l)
   0.000s (  0%) Column type detection (first, middle and last 5 rows)
   3.550s (  1%) Allocation of 3995803x13 result (xMB) in RAM
 491.590s ( 76%) Reading data
   0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
   0.000s (  0%) Coercing data already read in type bumps (if any)
   0.080s (  0%) Changing na.strings to NA
 649.030s        Total

.

> head(DT)
     itemid    category               subcategory                 title
1: 10000010               Toyota Sera, 1991
2: 10000025                          
3: 10000094   , ,        Steilmann
4: 10000101                Ford Focus, 2011
5: 10000132                  3.0 Bar
6: 10000152            2115 Samara, 2005
                                                                                                                                                                                                                                                                                                description
1:                                                                                          (, ),   ,  16- ,   ,    . ^p ! ! ^p ,   !!!
2:                                                                                                                                                                                                                                                          ^p :8@@PHONE@@
3:                                                                                                                                   .     . V    .      .     (+3-4 ).  40
4:     ,  , ,   ..   ,    .    /   .       !!!       .
5:                                                                                                                                                                                                                                       V-6 . V-8   16   .....
6:                                                                                                                                                                                                         8 @@PHONE@@
                                                                                                                                                                                                                                                                                                                                            attrs
1:        {"" "":""1991"", "" "":"""", """":""10 000 - 14 999"", "" "":"""", "" "":""1.5"", "" "":"""", """":""Toyota"", """":""Sera"", """":"""", """":"""", """":"""", """":"" ""}
2:                                                                                                                                                                                                                                                                                                     {"" "":"", ""}
3:                                                                                                                                                                                                                                            {"" "":"" "", "" "":""  "", """":""4648 (L)""}
4:              {"""":""Ford"", """":""Focus"", "" "":""2011"", """":""80 000 - 84 999"", "" "":"""", """":"""", "" "":""1.6"", "" "":"""", "" "":"""", """":"""", """":"""", """":"" ""}
5:                                                                                                                                                                                                                                                                              {"" "":"""", "" "":"" ""}
6: {"""":"" (LADA)"", """":""2115 Samara"", "" "":""2005"", """":""140 000 - 149 999"", "" "":"""", """":"""", "" "":""1.5"", "" "":"""", "" "":"""", """":"""", """":"""", """":"" ""}
    price is_proved is_blocked phones_cnt emails_cnt urls_cnt close_hours
1: 150000        NA          0          0          0        0        0.03
2:      0        NA          0          1          0        0       22.38
3:   1500        NA          0          0          0        0        0.41
4: 365000        NA          0          0          0        0        8.87
5:   5000        NA          0          0          0        0       11.82
6:      0        NA          0          1          0        0       22.55

.

> tail(DT)
     itemid            category                subcategory                                              title
1: 99999929                               
2: 99999962                    Bridgestone-Blizzak WS-60-225/50 R17--
3: 99999973                                                           1- , 39 ²
4: 99999974                                          ,    
5: 99999977                                                       Nokia 
6: 99999982                                            
                                                                                                                                                                                                                                                    description
1: 2    1560()*1050()    ,  2  ,,.       .  (  , ) 4000  ,   7000.
2:                                                                                                                   4 .  5-6 , . ^p   16 000  ^p    ^p 8-@@PHONE@@
3:                                                                                                                                                                                                                                 . .
4:                                                             .   , . ^p -  ,  ^p -  ^p -   ^p -  ^p -    ^p - 
5:                                                                                                                                                                                                                                           
6:                                                                                                                 .  .   ,   ,        ().
                                                                                                                          attrs price is_proved is_blocked phones_cnt emails_cnt urls_cnt close_hours
1:                                                                                          {"" "":""  ""}  4000        NA          0          0          0        0        0.69
2:                                                           {"" "":"",   "", "" "":""""} 16000        NA          0          1          0        0        0.04
3: {"" "":"""", "" "":""1"", "" "":""  "", """":""""} 11000        NA          0          0          0        0        0.20
4:                                                                                   {"" "":"", ""}     0        NA          0          0          0        0       23.50
5:                                                                                                {"" "":""""}   300        NA          0          0          0        0        5.72
6:                                                                                                 {"" "":""""}   300        NA          0          0          0        0       19.08

.

> dim(DT)
[1] 3995803      13

.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             AuthenticAMD
CPU family:            20
Model:                 2
Stepping:              0
CPU MHz:               800.000      # i.e. my slow netbook (4GB RAM)
BogoMIPS:              1995.01
Virtualisation:        AMD-V
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
NUMA node0 CPU(s):     0,1
+4

Source: https://habr.com/ru/post/1546102/


All Articles