I am new to pandas / r, and I'm not quite sure how to go about reading this data in pandasor rfor analysis.
Currently, I thought I could use readr read_chunkwiseor pandas chunksize, but that might not be what I need. Is this really something that is easy to solve with a for loop or using purrs to iterate through all the elements?
Data:
wine/name: 1981 Ch&
wine/wineId: 18856
wine/variant: Red Rhone Blend
wine/year: 1981
review/points: 96
review/time: 1160179200
review/userId: 1
review/userName: Eric
review/text: Olive, horse sweat, dirty saddle, and smoke. This actually got quite a bit more spicy and expressive with significant aeration. This was a little dry on the palate first but filled out considerably in time, lovely, loaded with tapenade, leather, dry and powerful, very black olive, meaty. This improved considerably the longer it was open. A terrific bottle of 1981, 96+ and improving. This may well be my favorite vintage of Beau except for perhaps the 1990.
wine/name: 1995 Ch&
wine/wineId: 3495 wine/variant: Red Bordeaux Blend
wine/year: 1995
review/points: 93
review/time: 1063929600
review/userId: 1
review/userName: Eric
review/text: A remarkably floral nose with violet and chambord. On the palate this is super sweet and pure with a long, somewhat searing finish. My notes are very terse, but this was a lovely wine.
This is currently what I have as a function, but I ran into an error:
<P → convertchunkfile <- function(df){ for(i in 1:length(df)){
>
> while(nchar(df[[i]]) != 0){
> case_when(
>
>
>
> cleandf$WineName[[i]] <- df[i] == str_sub(df[1],0, 10) ~ str_trim(substr(df[1], 11, nchar(df[1]))),
>
> cleandf$WineID[[i]] <- df[i] == str_sub(df[2],0,11) ~ str_trim(substr(df[2], 13, nchar(df[1])))
>
> )
> }
> }
> }
Error in cleandf$BeerName[[i]] <- df[i] == str_sub(df[1], 0, 10) ~ str_trim(substr(df[1], :
more elements supplied than there are to replace
EDIT:
After some problems, I think this is perhaps the best solution taken from @hereismyname's solution:
iconv -c -t UTF-8 cellartracker-clean.txt > cellartracker-iconv.txt
wc -l cellartracker-iconv.txt
20259950 cellartracker-iconv.txt
file -I cellartracker-clean.txt
ReadEmAndWeep <- function(file, chunk_size) {
f <- function(chunk, pos) {
data_frame(text = chunk) %>%
filter(text != "") %>%
separate(text, c("var", "value"), ":", extra = "merge") %>%
mutate(
chunk_id = rep(1:(nrow(.) / 9), each = 9),
value = trimws(value)
) %>%
spread(var, value)
}
read_lines_chunked(file, DataFrameCallback$new(f), chunk_size = chunk_size)
}
dataframe <- ReadEmAndWeep(file, chunk_size = 100000)