I want to extract (ip, requestUrl, timeStamp) from access logs to load into the hive database. One line from the access log is as follows.
66.249.68.6 - - [14/Jan/2012:06:25:03 -0800] "GET /example.com HTTP/1.1" 200 708 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
I have tried the following and several regex options without any success. (A loaded table with all NULL values indicating that the regular expression does not match the input).
CREATE TABLE access_log ( remote_ip STRING, request_date STRING, method STRING, request STRING, protocol STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^ ]) . . [([^]]+)] \"([^ ]) ([^ ]) ([^ \"])\" *", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s" ) STORED AS TEXTFILE;
I am not very experienced with regex. Can someone help me with this?
source share