How to make MSCK REPAIR TABLE automatically run in AWS Athena

I have a spark discharge job that runs hourly. Each run generates and saves new data in S3a directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.

After loading the data into S3, I want to research it using Athena. Moreover, I would like to visualize them in QuickSightby connecting to Athena as a data source.

The problem is that after every launch of my Spark batch, the newly created data stored in S3will not be detected by Athena, unless I manually executed the request MSCK REPARI TABLE.

Is there a way to make Athena automatically update data so that I can create a fully automatic data visualization pipeline?

+19
source share
2 answers

There are several ways to schedule this task. How do you plan your work processes? Do you use a system like Airflow , Luigi , Azkaban , cron or the AWS data pipeline ?

From any of these, you can run the following CLI command.

$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"

Another option would be AWS Lambda . You may have a function that calls MSCK REPAIR TABLE some_database.some_tablein response to a new download on S3.

An example of a lambda function can be written like this:

import boto3

def lambda_handler(event, context):
    bucket_name = 'some_bucket'

    client = boto3.client('athena')

    config = {
        'OutputLocation': 's3://' + bucket_name + '/',
        'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}

    }

    # Query Execution Parameters
    sql = 'MSCK REPAIR TABLE some_database.some_table'
    context = {'Database': 'some_database'}

    client.start_query_execution(QueryString = sql, 
                                 QueryExecutionContext = context,
                                 ResultConfiguration = config)

Then you must configure the trigger to execute your lambda function when new data is added under the prefix DATA/in your segment.

, Spark Job . , AWS Lambda , .

+13

ADD PARTITION:

aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."

S3 Athena Hive . , CREATE TABLE. PARTITIONED BY , .

+2

Source: https://habr.com/ru/post/1690000/


All Articles