How to make MSCK REPAIR TABLE automatically run in AWS Athena

Question

How to make MSCK REPAIR TABLE automatically run in AWS Athena

I have a spark discharge job that runs hourly. Each run generates and saves new data in S3a directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.

After loading the data into S3, I want to research it using Athena. Moreover, I would like to visualize them in QuickSightby connecting to Athena as a data source.

The problem is that after every launch of my Spark batch, the newly created data stored in S3will not be detected by Athena, unless I manually executed the request MSCK REPARI TABLE.

Is there a way to make Athena automatically update data so that I can create a fully automatic data visualization pipeline?

+19

amazon-s3 amazon-web-services hive amazon-athena amazon quicksight

Yangzhao Nov 29 '17 at 6:49

source share

2 answers

ADD PARTITION:

aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."

S3 Athena Hive . , CREATE TABLE. PARTITIONED BY , .

+2

Tony Marti 14 . '18 20:56

Zerodf · Accepted Answer · 2017-11-29T14:12:45+0000

There are several ways to schedule this task. How do you plan your work processes? Do you use a system like Airflow , Luigi , Azkaban , cron or the AWS data pipeline ?

From any of these, you can run the following CLI command.

$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"

Another option would be AWS Lambda . You may have a function that calls MSCK REPAIR TABLE some_database.some_tablein response to a new download on S3.

An example of a lambda function can be written like this:

import boto3

def lambda_handler(event, context):
    bucket_name = 'some_bucket'

    client = boto3.client('athena')

    config = {
        'OutputLocation': 's3://' + bucket_name + '/',
        'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}

    }

    # Query Execution Parameters
    sql = 'MSCK REPAIR TABLE some_database.some_table'
    context = {'Database': 'some_database'}

    client.start_query_execution(QueryString = sql, 
                                 QueryExecutionContext = context,
                                 ResultConfiguration = config)

Then you must configure the trigger to execute your lambda function when new data is added under the prefix DATA/in your segment.

, Spark Job . , AWS Lambda , .

How to make MSCK REPAIR TABLE automatically run in AWS Athena

More articles: