There are several ways to schedule this task. How do you plan your work processes? Do you use a system like Airflow , Luigi , Azkaban , cron or the AWS data pipeline ?
From any of these, you can run the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda . You may have a function that calls MSCK REPAIR TABLE some_database.some_tablein response to a new download on S3.
An example of a lambda function can be written like this:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
Then you must configure the trigger to execute your lambda function when new data is added under the prefix DATA/in your segment.
, Spark Job . , AWS Lambda , .