CSV analysis using aws athena

I am parsing a csv file using AWS athena from java code. Some columns in csv have a date type, and one column has a value in the value.

If the athena table is created using

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

then it cannot parse the semicolon column correctly

However, it parses correctly if I use

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

But the problem with OpenCSVSerde is that all columns require a string data type, and I need to perform date operations on the query, so we cannot use OpenCSVSerde .

Any other solution? Please, help!

+5
source share
1 answer

How these two SerDes are created, you should use only LazySimpleSerDe in cases where your data is relatively clean, for example, it does not have values ​​enclosed in quotation marks or without separators in the value. And OpenCSVSerde works great for deserializing CSV files that have quoted values; however, all columns in the table are of the STRING data type. More here

So, in your case, since your data is not clean, the only way to parse it and load it into Athena is to use OpenCSVSerde . And if you need to use date operations, you need to manually convert / parse the date strings into a date object, which is pretty easy to do with the date_parse function.

Tell me if you have the following string data in a date type column:

 11/13/2017 11/14/2017 11/15/2017 11/16/2017 

You can use the following query to select a date in a range

 select * from somedb.sometable where date_parse(createdate, '%m/%d/%Y') between DATE'2017-11-14' and DATE'2017-11-16'; 
+2
source

Source: https://habr.com/ru/post/1268964/


All Articles