How to get over 100,000 rows from Redshift using R and dplyr

Question

How to get over 100,000 rows from Redshift using R and dplyr

I am analyzing data from a Redshift database while working in R using a connection for dplyr - which works:

my_db<-src_postgres(host='my-cluster-blahblah.redshift.amazonaws.com', port='5439', dbname='dev',user='me', password='mypw')
mytable <- tbl(my_db, "mytable")

viewstation<-mytable %>%
    filter(stationname=="something")

When I try to turn this output into a data frame, like this:

thisdata<-data.frame(viewstation)

I get an error, Warning message:

Only first 100,000 results retrieved. Use n = -1 to retrieve all.

Where should I install n?

+4

r amazon-redshift dplyr rpostgresql

Lucy Jul 17 '15 at 23:28

source share

2 answers

phiver · Answer 1 · 2015-07-18T10:51:03+0000

Instead of using

thisdata<-data.frame(viewstation)

using

thisdata <- collect(viewstation)

collect () will pull all the data from the database back to R. As mentioned in DPLYR :: database vignette:

When working with databases, dplyr tries to be as lazy as possible. Its lazy in two ways:
It never pulls data back to R unless you explicitly ask for it.
, , , .

qfazille · Answer 2 · 2017-06-20T15:28:22+0000

, dplyr 0.5 ( ).

n collect.

my_db<-src_postgres(host='my-cluster-blahblah.redshift.amazonaws.com', port='5439', dbname='dev',user='me', password='mypw')
mytable <- tbl(my_db, "mytable") %>% collect(n = Inf)

100 000 .

How to get over 100,000 rows from Redshift using R and dplyr

, dplyr 0.5 ( ).

More articles: