Python / Hive interface slow with fetchone (), hangs with fetchall ()

Question

Python / Hive interface slow with fetchone (), hangs with fetchall ()

I have a python script that requests HiveServer2 using pyhs2, for example:

import pyhs2; conn = pyhs2.connect(host=localhost, port=10000, user='user', password='password', database='default'); cur = conn.cursor(); cur.execute("SELECT name,data,number,time FROM table WHERE date = '2014-01-01' AND number in (1,5,6,22) ORDER BY name,time ASC"); line = cur.fetchone(); while line is not None: <do some processing, including writing to stdout> . . . line = cur.fetchone();

I also tried using fetchall () instead of fetchone (), but that seems to last forever.

My query runs just fine and returns ~ 270 million rows. For testing, I pulled the output from Hive into a flat tab delimited file and wrote the courage of my python script, so I didn't have to wait for the request to complete each time I ran it. My script that reads a flat file will end in ~ 20 minutes. What confuses me is that I don’t see the same performance when I directly access Hive. In fact, it takes about 5 times longer to complete processing. I am new to Hive and python, so maybe I am making some bone errors, but the examples I see on the Internet show such a setup. I just want to repeat my return to Hive, getting one line at a time as quickly as possible, just like I used my flat file. Any suggestions?

PS I found this question similar to:

Python slow on fetchone, hangs on fetchall

but that turned out to be a SQLite problem, and I have no control over installing Hive.

+6

python hive

Jeff hall Aug 11 '14 at 16:55

source share

1 answer

yacc143 · Answer 1 · 2014-09-16T17:05:54+0000

Have you considered using fetchmany ().

This will be DBAPI's answer for pulling data in chunks (larger, where problems arise, and smaller than all rows that have a memory problem).

Python / Hive interface slow with fetchone (), hangs with fetchall ()

More articles: