Amazon Athena does not analyze cloud graphics logs

I am following the Athena Getting Started Guide and trying to parse my own Cloudfront logs. However, the fields are not processed.

I used a small test file as shown below:

#Version: 1.0 #Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type 2016-02-02 07:57:45 LHR5 5001 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /foo 404 - Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Error -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA== d3g47gpj5mj0b.cloudfront.net https421 0.076 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error 2016-02-02 07:57:45 LHR5 1158241 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Miss oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA== d3g47gpj5mj0b.cloudfront.net https 419 0.440 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss 

And created a table with this SQL:

 CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs ( `Date` DATE, Time STRING, Location STRING, Bytes INT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, os STRING, Browser STRING, BrowserVersion STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$" ) LOCATION 's3://test/athena-csv/' 

But the data is not returned:

athena screen shot without data

I see that it returns 4 lines, but the first 2 should be excluded because they start with the # character, so it, like the regular expression, is not processed correctly.

Am I doing something wrong? Or a wrong regex (seems unlikely, as in the docs, and looks good to me)?

+6
source share
6 answers

Here is what I ended up with:

 CREATE EXTERNAL TABLE logs ( `date` date, `time` string, `location` string, `bytes` int, `request_ip` string, `method` string, `host` string, `uri` string, `status` int, `referer` string, `useragent` string, `uri_query` string, `cookie` string, `edge_type` string, `edget_requiest_id` string, `host_header` string, `cs_protocol` string, `cs_bytes` int, `time_taken` string, `x_forwarded_for` string, `ssl_protocol` string, `ssl_cipher` string, `result_type` string, `protocol` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)' ) LOCATION 's3://logs' 

Note that double backslashes are intentional.

The format of cloud print magazines at some point changed to add protocol . It processes older and newer files.

+8
source

In fact, all the answers here have a slight error: the 4th field should be BIGINT, not INT. Otherwise, your> 2 GB file requests will not be parsed correctly. After a lengthy discussion with AWS Business Support, it turned out that the correct format would be:

 CREATE EXTERNAL TABLE your_table_name ( 'Date' DATE, Time STRING, Location STRING, SCBytes BIGINT, RequestIP STRING, Method STRING, Host STRING, Uri STRING, Status INT, Referrer STRING, UserAgent STRING, UriQS STRING, Cookie STRING, ResultType STRING, RequestId STRING, HostHeader STRING, Protocol STRING, CSBytes BIGINT, TimeTaken FLOAT, XForwardFor STRING, SSLProtocol STRING, SSLCipher STRING, ResponseResultType STRING, CSProtocolVersion STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://path_to_your_data_directory' TBLPROPERTIES ('skip.header.line.count' = '2') 
+4
source

After pulling my hair out with this, and improving on @CoderDans, answer:

The secret is to use \ t to separate values ​​instead of \ s for regular expression.

 CREATE EXTERNAL TABLE IF NOT EXISTS mytablename ( `date` date, `time` string, `location` string, `bytes` int, `request_ip` string, `method` string, `host` string, `uri` string, `status` int, `referer` string, `useragent` string, `uri_query` string, `cookie` string, `edge_type` string, `edget_request_id` string, `host_header` string, `cs_protocol` string, `cs_bytes` int, `time_taken` int, `x_forwarded_for` string, `ssl_protocol` string, `ssl_cipher` string, `result_type` string, `protocol_version` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '^(?!#.*)(?!#.*)([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)$' ) LOCATION 's3://mybucket/myprefix/'; 
+1
source

Athena is not case sensitive and considers each column to be lowercase. . Try defining an Athena table and query with lowercase names.

0
source

Demo doesn't work for me either. After playing with him a bit, I got the following:

 CREATE EXTERNAL TABLE IF NOT EXISTS DBNAME.TABLENAME ( `date` date, `time` string, `location` string, `bytes` int, `request_ip` string, `method` string, `host` string, `uri` string, `status` int, `referer` string, `useragent` string, `uri_query` string, `cookie` string, `edge_type` string, `edget_requiest_id` string, `host_header` string, `cs_protocol` string, `cs_bytes` int, `time_taken` string, `x_forwarded_for` string, `ssl_protocol` string, `ssl_cipher` string, `result_type` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1', 'input.regex' = '^(?!#.*)(?!#.*)([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)$' ) LOCATION 's3://bucket/logs/'; 

Replace bucket / log information and dbname.table. For some reason, it still inserts blank lines for C # lines, but I got the remaining data.

I think the next step is to try to make it for user agents or cookies.

0
source

This one worked for me. I started here , but I had to add the “protocol” column.

 CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs ( 'date' DATE, time STRING, location STRING, bytes BIGINT, request_ip STRING, method STRING, host STRING, uri STRING, status INT, referrer STRING, user_agent STRING, query_string STRING, cookie STRING, result_type STRING, request_id STRING, host_header STRING, request_protocol STRING, request_bytes BIGINT, time_taken FLOAT, xforwarded_for STRING, ssl_protocol STRING, ssl_cipher STRING, response_result_type STRING, http_version STRING, fle_status STRING, fle_encrypted_fields INT, protocol string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION 's3://bucketname/prefix/' TBLPROPERTIES ( 'skip.header.line.count'='2' ) 
0
source

Source: https://habr.com/ru/post/1015483/


All Articles