R: How to read selected columns from RDS files?

How to read some data from very large files?

Sample data is generated as:

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
                 replicate(10, stringi::stri_rand_strings(1000, 5)))
head(df)
#     X1   X2   X3   X4   X5   X6   X7   X8   X9  X10  X1.1  X2.1  X3.1  X4.1  X5.1  X6.1  X7.1  X8.1  X9.1 X10.1
# 1  575 1843 1854  883  592 1362 1075  210 1526 1365 Qk8NP Xvw9z OYRa1 8BGIV bejiv CCoIE XDKJN HR7zc 2kKNY 1I5h8
# 2 1577  390 1861  912  277  636  758 1461 1978 1865 ZaHFl QLsli E7lbs YGq8u DgUAW c6JQ0 RAZFn Sc0Zt mif8I 3Ys6U
# 3  818 1076  147 1221  257 1115  759 1959 1088 1292 jM5Uw ctM3y 0HiXR hjOHK BZDOP ULQWm Ei8qS BVneZ rkKNL 728gf
# 4 1766  884 1331 1144 1260  768 1620 1231 1428 1193 r4ZCI eCymC 19SwO Ht1O0 repPw YdlSW NRgfL RX4ta iAtVn Hzm0q
# 5 1881 1851 1324 1930 1584 1318  940 1796  830   15 w8d1B qK1b0 CeB8u SlNll DxndB vaufY ZtlEM tDa0o SEMUX V7tLQ
# 6   91  264 1563  414  914 1507 1935 1970  287  409 gsY1u FxIgu 2XqS4 8kreA ymngX h0hkK reIsn tKgQY ssR7g W3v6c

saveRDS used to save the file.

saveRDS(df, 'df.rds')

The file size looks with the following commands:

file.info('df.rds')$size
# [1] 29935125
utils:::format.object_size(29935125, "auto")
# [1] "28.5 Mb"

The saved file is read using the following function.

readRDS('df.rds')

However, some of my files are in GBsand some processing will require several columns. Can I read selected columns from RDSfiles?

Note. I already have RDS files created after significantly large amounts of processing. Now I want to find out the best way to read selected columns from existing RDS files.

+4
source share
2 answers

SQLite // SQLite. SQL DPLYR . , SQLite , , .

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
                 replicate(10, stringi::stri_rand_strings(1000, 5)))

library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), dbname="myDB")
dbWriteTable(conn,"mytable",df)
alltables <- dbListTables(conn)
# Use sql queries to query data...
oneColumn <- dbGetQuery(conn,"SELECT X1 FROM mytable")

library(dplyr)
library(dbplyr)
my_db <- tbl(conn, "mytable")
my_db
# Use dplyr functions to query data...
my_db %>% select(X1)
+2

, rds rda.

feather. , , :

library(feather)
file.info("../feathers/C1.feather")["size"]
#                              size
#  ../feathers/C1.feather 498782328

system.time( c1whole <- read_feather("../feathers/C1.feather") )
#     user  system elapsed
#    0.860   0.856   5.540
system.time( c1dyn <- feather("../feathers/C1.feather") )
#     user  system elapsed
#        0       0       0

ls.objects()
#             Type      Size PrettySize          Dim
#  c1dyn   feather      3232     3.2 Kb 2886147 x 36
#  c1whole  tbl_df 554158688   528.5 Mb 2886147 x 36

data.frames: c1whole ( ), c1dyn .

NB: (, dplyr) feather, data.frame tbl_df. , , .

+1

Source: https://habr.com/ru/post/1689273/


All Articles