How to skip headers when we read data from a csv file in s3 and create a table in aws athena.

Question

How to skip headers when we read data from a csv file in s3 and create a table in aws athena.

I am trying to read csv data from an s3 bucket and create a table in AWS Athena. My table at creation was not able to skip the header information of my CSV file.

Request example:

CREATE EXTERNAL TABLE IF NOT EXISTS table_name ( `event_type_id` string, `customer_id` string, `date` string, `email` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "|", "quoteChar" = "\"" ) LOCATION 's3://location/' TBLPROPERTIES ("skip.header.line.count"="1");

skip.header.line.count does not seem to work. But that will not work. I think Aws has some problems with this. Is there any other way I could get through this?

+5

amazon-s3 amazon-web-services csv amazon-athena

Dinesh kumar paladhi Aug 3 '17 at 15:37

source share

2 answers

This is a known flaw.

The best way I've seen is on Twitter by Eric Hammond :

 ...WHERE date NOT LIKE '#%'

It seems like skipping header lines during the request. I'm not sure how this works, but it could be a NULL skip method.

+1

John rotenstein Aug 3 '17 at 23:11

source share

TheWalkingData · Accepted Answer · 2017-12-08T22:52:39+0000

This is what works in Redshift:

You want to use table properties ('skip.header.line.count'='1') Along with other properties, if you want, for example. 'numRows'='100' . Here's a sample:

 create external table exreddb1.test_table (ID BIGINT ,NAME VARCHAR ) row format delimited fields terminated by ',' stored as textfile location 's3://mybucket/myfolder/' table properties ('numRows'='100', 'skip.header.line.count'='1');

How to skip headers when we read data from a csv file in s3 and create a table in aws athena.

More articles: