Using sqldf () to select rows matching millions of items

Question

Using sqldf () to select rows matching millions of items

This is a continuation of the answer provided here using sqldf()

In my particular case, I have a tab delimited file with over 110 million lines. I would like to select the strings corresponding to 4.6 million tag identifiers.

In the following code, tag identifiers are in tag.query

However, although the example will work with a smaller query, it does not handle the above example:

 sql.query <- paste('select * from f where v2 in (', tag.query, ')', sep='') selected.df <- sqldf(sql.query, dbname = tempfile(), file.format = list(header = F, row.names = F, sep="\t", skip=line.where.header.is))

Any suggestions for alternative ratings?

+4

r bigdata sqldf

andrewj Mar 19 '12 at 19:19

source share

2 answers

G. grothendieck · Answer 1 · 2012-03-19T20:07:34+0000

If the problem is speed, try creating an index on v2 . See Example 4i on the sqldf homepage . If it is not fast enough yet, you can also try using a different database. Like SQLite by default, sqldf supports H2, MySQL, and PostgreSQL.

Arnaud a · Answer 2 · 2012-03-19T21:22:23+0000

You will need to index your table, as mentioned earlier. But the SQLite database fails to create an index with more than 10 million records, it becomes extremely slow. I tried with 40 million records and it freezes. I do not know how other databases perform CREATE INDEX on a large table.

I had the same problem and I ended up sorting the table by tag id and writing it to a text file. Then I wrote a binary search in C ++, which looked directly in a text file for tag identifiers. This accelerated the execution, since the binary search is O (log N) vs O (N) to search for grep, with N in tens of millions. I can share it if you need.

Using sqldf () to select rows matching millions of items

More articles: