How to skip headers when we read data from a csv file in s3 and create a table in aws athena.

I am trying to read csv data from an s3 bucket and create a table in AWS Athena. My table at creation was not able to skip the header information of my CSV file.

Request example:

CREATE EXTERNAL TABLE IF NOT EXISTS table_name ( `event_type_id` string, `customer_id` string, `date` string, `email` string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "|", "quoteChar" = "\"" ) LOCATION 's3://location/' TBLPROPERTIES ("skip.header.line.count"="1"); 

skip.header.line.count does not seem to work. But that will not work. I think Aws has some problems with this. Is there any other way I could get through this?

+5
source share
2 answers

This is what works in Redshift:

You want to use table properties ('skip.header.line.count'='1') Along with other properties, if you want, for example. 'numRows'='100' . Here's a sample:

 create external table exreddb1.test_table (ID BIGINT ,NAME VARCHAR ) row format delimited fields terminated by ',' stored as textfile location 's3://mybucket/myfolder/' table properties ('numRows'='100', 'skip.header.line.count'='1'); 
+2
source

This is a known flaw.

The best way I've seen is on Twitter by Eric Hammond :

 ...WHERE date NOT LIKE '#%' 

It seems like skipping header lines during the request. I'm not sure how this works, but it could be a NULL skip method.

+1
source

Source: https://habr.com/ru/post/1270541/


All Articles