I have a python script that requests HiveServer2 using pyhs2, for example:
import pyhs2; conn = pyhs2.connect(host=localhost, port=10000, user='user', password='password', database='default'); cur = conn.cursor(); cur.execute("SELECT name,data,number,time FROM table WHERE date = '2014-01-01' AND number in (1,5,6,22) ORDER BY name,time ASC"); line = cur.fetchone(); while line is not None: <do some processing, including writing to stdout> . . . line = cur.fetchone();
I also tried using fetchall () instead of fetchone (), but that seems to last forever.
My query runs just fine and returns ~ 270 million rows. For testing, I pulled the output from Hive into a flat tab delimited file and wrote the courage of my python script, so I didn't have to wait for the request to complete each time I ran it. My script that reads a flat file will end in ~ 20 minutes. What confuses me is that I donβt see the same performance when I directly access Hive. In fact, it takes about 5 times longer to complete processing. I am new to Hive and python, so maybe I am making some bone errors, but the examples I see on the Internet show such a setup. I just want to repeat my return to Hive, getting one line at a time as quickly as possible, just like I used my flat file. Any suggestions?
PS I found this question similar to:
Python slow on fetchone, hangs on fetchall
but that turned out to be a SQLite problem, and I have no control over installing Hive.
source share