Which approach avoids data duplication in BigQuery based on a subset of columns?

Question

Which approach avoids data duplication in BigQuery based on a subset of columns?

I am creating CSV files containing log data for the application. Columns in the CSV file timestamp, source_address, destination_url, request_type. When I upload a CSV file to BigQuery, it simply adds the data in the CSV to an existing table in BigQuery. I would like to avoid duplicate sets source_address, destination_url, request_typeand just keep track of the last timestamp for such a set.

One of the ways I decided to do this is GROUP BY source_address, destination_url, request_typeto get it MAX(timestamp), but that means that I will need to save this query in a new table, after which I can then query and then copy it back to the original table, into which I periodically load the CSV file (s).

Is there a better way to do this? Duplicates would be in order, except for the fact that Google charges a fee for how much data is being processed.

---- EDIT # 1 ----

I am also completely open to ways to remove duplicate CSV data before loading into BiqQuery, so if anyone has any interesting ideas on how to use command line tools to combine CSV files based on specific column indices or something like that that I would like to hear about them.

---- EDIT # 2 ----

, sort, , - , , - . sort -t, -k1,1 -r logfile.csv | sort -u -t, -k2,4 , ? - , , , .

+4

google-bigquery

Bryan 21 . '14 23:20

1

Jordan Tigani · Accepted Answer · 2014-12-29T18:55:40+0000

, , - , , . , GROUP EACH BY GROUP BY, .

Which approach avoids data duplication in BigQuery based on a subset of columns?

More articles: