Read multiple files in a Hive table by date

Question

Read multiple files in a Hive table by date

Suppose I store one file per day in the format:

/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv

How can I read files in a single Hive table for a given date range (e.g. from 2016-06-04 to 2016-08-03)?

+4

hive

Dmitry Petrov Aug 4 '16 at 8:54

source share

3 answers

:

. hive ( )
HiveQL ( * dt '2016-06-04' '2016-08-03')

:

. hive ( )
/path/to/files/2016/07/31.csv /dbname.db/tableName/dt=2016-07-31, /dbname.db/tableName/dt=2016-07-31/file1.csv /dbname.db/tableName/dt=2016-08-01/file1.csv /dbname.db/tableName/dt=2016-08-02/file1.csv
alter table tableName add partition (dt=2016-07-31);

.

0

waltersu 04 . '16 9:31

Spark-shell read hive table

/path/to/data/user_info/dt=2016-07-31/0000-0

1.create sql

val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"

2. run it

spark.sql(sql)

3.load data

val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")

Now you can select data from the table

val df = spark.sql("select * from user_info")

0

user1649566 Sep 22 '16 at 3:39

source share

cheseaux · Accepted Answer · 2016-08-04T09:19:23+0000

Assuming all files follow the same pattern, I would suggest that you store files with the following naming convention:

/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv

Then you can create an external table, split into dtand pointing to the location/path/to/files/

CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'

If you have several partitions and do not want to write queries alter table yourtable add partition ...for each of them, you can simply use the restore command, which will automatically add partitions.

msck repair table yourtable

,

SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'

Read multiple files in a Hive table by date

More articles: