Reading gzip file contents with AWS S3 in Python

I am trying to read some of the logs from the Hadoop process that I am running in AWS. Logs are stored in the S3 folder and have the following path.

bucketname = name key = y / z / stderr.gz Here Y is the cluster identifier, and z is the folder name. Both of them act as folders (objects) in AWS. So the full path is like x / y / z / stderr.gz.

Now I want to unzip this.gz file and read the contents of the file. I do not want to upload this file to my system in order to save the contents in a python variable.

This is what I have tried so far.

bucket_name = "name" key = "y/z/stderr.gz" obj = s3.Object(bucket_name,key) n = obj.get()['Body'].read() 

This gives me a format that is not readable. I also tried

 n = obj.get()['Body'].read().decode('utf-8') 

which gives the error, utf8 'codec cannot decode the 0x8b byte at position 1: invalid start byte.

I also tried

 gzip = StringIO(obj) gzipfile = gzip.GzipFile(fileobj=gzip) content = gzipfile.read() 

This returns an IOError error : not a gzipped file

Not sure how to decode this.gz file.

Change - find a solution. You must pass n to it and use BytesIO

 gzip = BytesIO(n) 
+18
source share
6 answers

@Amit, I tried to do the same to test file decoding, and got your code to run with some changes. I just had to remove the def, return function and rename the gzip variable since this name is used.

 import json import boto3 from io import BytesIO import gzip try: s3 = boto3.resource('s3') key='YOUR_FILE_NAME.gz' obj = s3.Object('YOUR_BUCKET_NAME',key) n = obj.get()['Body'].read() gzipfile = BytesIO(n) gzipfile = gzip.GzipFile(fileobj=gzipfile) content = gzipfile.read() print(content) except Exception as e: print(e) raise e 
+7
source

You can use AWS S3 SELECT Object Content to read gzip content

S3 Select is an Amazon S3 feature designed to retrieve only the data you need from an object, which can significantly improve performance and reduce the cost of applications that need access to data in S3.

Amazon S3 Select works with objects stored in the Apache Parquet format, JSON arrays, and BZIP2 compression for CSV and JSON objects.

Link: https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html

 from io import StringIO import boto3 import pandas as pd bucket = 'my-bucket' prefix = 'my-prefix' client = boto3.client('s3') for object in client.list_objects_v2(Bucket=bucket, Prefix=prefix)['Contents']: if object['Size'] <= 0: continue print(object['Key']) r = client.select_object_content( Bucket=bucket, Key=object['Key'], ExpressionType='SQL', Expression="select * from s3object", InputSerialization = {'CompressionType': 'GZIP', 'JSON': {'Type': 'DOCUMENT'}}, OutputSerialization = {'CSV': {'QuoteFields': 'ASNEEDED', 'RecordDelimiter': '\n', 'FieldDelimiter': ',', 'QuoteCharacter': '"', 'QuoteEscapeCharacter': '"'}}, ) for event in r['Payload']: if 'Records' in event: records = event['Records']['Payload'].decode('utf-8') payloads = (''.join(r for r in records)) try: select_df = pd.read_csv(StringIO(payloads), error_bad_lines=False) for row in select_df.iterrows(): print(row) except Exception as e: print(e) 
+3
source

Read Bz2 extension file from aws s3 in python

 import json import boto3 from io import BytesIO import bz2 try: s3 = boto3.resource('s3') key='key_name.bz2' obj = s3.Object('bucket_name',key) nn = obj.get()['Body'].read() gzipfile = BytesIO(nn) content = bz2.decompress(gzipfile.read()) content = content.split('\n') print len(content) except Exception as e: print(e) 
+1
source

Just like what we do with variables, data can be stored in bytes in the memory buffer when we use the input-output operations of the Byte IO Io modules.

Here is an example program to demonstrate this:

 mport io stream_str = io.BytesIO(b"JournalDev Python: \x00\x01") print(stream_str.getvalue()) 

The getvalue() function takes a value from the buffer as a string.

So @ Jean-FranรงoisFabre answer is correct and you should use

 gzip = BytesIO(n) 

For more information, read the following document:

https://docs.python.org/3/library/io.html

0
source

I tried your above code, but still getting below errors

  "errorMessage": "'_io.BytesIO' object has no attribute 'GzipFile'", "stackTrace": [ " File \"/var/task/lambda_function.py\", line 20, in lambda_handler\n raise e\n", " File \"/var/task/lambda_function.py\", line 14, in lambda_handler\n gzipfile = gzip.GzipFile(fileobj=gzip)\n" 

below is my code, phython = 3.7

 import json import boto3 from io import BytesIO import gzip def lambda_handler(event, context): try: s3 = boto3.resource('s3') key='test.gz' obj = s3.Object('athenaamit',key) n = obj.get()['Body'].read() #print(n) gzip = BytesIO(n) gzipfile = gzip.GzipFile(fileobj=gzip) content = gzipfile.read() print(content) return 'dddd' except Exception as e: print(e) raise e 
0
source

Currently the file can be read as

 role = 'role name' bucket = 'bucket name' data_key = 'data key' data_location = 's3://{}/{}'.format(bucket, data_key) data = pd.read_csv(data_location,compression='gzip', header=0, sep=',', quotechar='"') 
0
source

Source: https://habr.com/ru/post/1261317/


All Articles