I have some web server logs that I would like to request using Hive. The directory structure in HDFS is as follows:
/data/access/web1/2014/09 /data/access/web1/2014/09/access-20140901.log [... etc ...] /data/access/web1/2014/10 /data/access/web1/2014/10/access-20141001.log [... etc ...] /data/access/web2/2014/09 /data/access/web2/2014/09/access-20140901.log [... etc ...] /data/access/web2/2014/10 /data/access/web2/2014/10/access-20141001.log [... etc ...]
I can create an external table:
CREATE EXTERNAL TABLE access( host STRING, identity STRING, user STRING, time STRING, request STRING, status STRING, size STRING, referer STRING, agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s") LOCATION '/data/access/'
... although Hive doesn't go down to subfolders unless I run the following commands before running the Hive request:
set hive.input.dir.recursive=true; set hive.mapred.supports.subdirectories=true; set hive.supports.subdirectories=true; set mapred.input.dir.recursive=true;
I saw other records setting these properties at the table level (for example, Problem creating an external Hive table using tblproperties ):
TBLPROPERTIES ("hive.input.dir.recursive" = "TRUE", "hive.mapred.supports.subdirectories" = "TRUE", "hive.supports.subdirectories" = "TRUE", "mapred.input.dir.recursive" = "TRUE");
Unfortunately, this did not work for me: the table does not return any records when requested. I understand that you can set these properties in hive-site.xml, but I would prefer not to make any changes that could affect other users if I do not need it.
Q) Is there a way to create a table that descends into subdirectories without using partitions, making changes at the level of the entire site or by executing these 4 commands each time?