I have a dataset that contains 57 million rows and 23 columns. There is a column with the names of species of different birds (about 2000 unique names), and I would like to select two data columns (latitude, longitude) for each unique name of the species and write lat / long data for each species in the file, with the name of the species as the name file. It takes too much time to make of R, the only language I know. What would be the appropriate code for this task?
I am trying to use some pseudo code to demonstrate that I assume that the code might look something like this:
FOR i IN 1:unique(species_name) SELECT latitude,longitude WHERE species_name=[i] WRITE [some code that writes a text file with species name as the file name] LOOP END;
I assume I can do such things in an OSX terminal?
EDIT 20111211: Here is my workflow from R:
require(RMySQL); require(plyr) drv <- dbDriver("MySQL") con <- dbConnect(drv, user = "asdfaf", dbname = "test", host = "localhost") splist <- read.csv("splist_use.csv") sqlwrite <- function(spname) { cat(spname) g1 <- dbGetQuery(con , paste("SELECT col_16,col_18 FROM dat WHERE col_11='" , spname, "'", sep="") ) write.csv(g1, paste(spname, ".csv", sep="")) rm("g1") } l_ply(splist, sqlwrite, .progress="text" )
source share