Read multiple files in a Hive table by date

Suppose I store one file per day in the format:

/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv

How can I read files in a single Hive table for a given date range (e.g. from 2016-06-04 to 2016-08-03)?

+4
source share
3 answers

Assuming all files follow the same pattern, I would suggest that you store files with the following naming convention:

/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv

Then you can create an external table, split into dtand pointing to the location/path/to/files/

CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'

If you have several partitions and do not want to write queries alter table yourtable add partition ...for each of them, you can simply use the restore command, which will automatically add partitions.

msck repair table yourtable

,

SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
+2

:

  • . hive ( )
  • HiveQL ( * dt '2016-06-04' '2016-08-03')

:

  • . hive ( )
  • /path/to/files/2016/07/31.csv /dbname.db/tableName/dt=2016-07-31, /dbname.db/tableName/dt=2016-07-31/file1.csv /dbname.db/tableName/dt=2016-08-01/file1.csv /dbname.db/tableName/dt=2016-08-02/file1.csv

  • alter table tableName add partition (dt=2016-07-31);

.

0

Spark-shell read hive table

/path/to/data/user_info/dt=2016-07-31/0000-0

1.create sql

val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"

2. run it

spark.sql(sql)

3.load data

val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")

Now you can select data from the table

val df = spark.sql("select * from user_info")
0
source

Source: https://habr.com/ru/post/1650140/


All Articles