I am looking for ways to read data from multiple partitioned directories from s3 using python.
data_folder / time_in_numeric_format = 1 / cur_date = 20-12-2012 / abcdsd0324324.snappy.parquet data_folder / time_in_numeric_format = 2 / cur_date = 27-12-2012 / asdsdfsd0324324.snappy.parquet
pyarrow The ParquetDataset module has the ability to read from sections. So I tried the following code:
>>> import pandas as pd >>> import pyarrow.parquet as pq >>> import s3fs >>> a = "s3://my_bucker/path/to/data_folder/" >>> dataset = pq.ParquetDataset(a)
Threw the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__ self.metadata_path) = _make_manifest(path_or_paths, self.fs) File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest .format(path)) OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
Based on the pyarrow documentation, I tried using s3fs as a file system, i.e.:
>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
Which causes the following error:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__ self.metadata_path) = _make_manifest(path_or_paths, self.fs) File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in _make_manifest if is_string(path_or_paths) and fs.isdir(path_or_paths): AttributeError: module 's3fs' has no attribute 'isdir'
I am limited to using an ECS cluster, so spark / pyspark is not an option .
Is there a way to easily read parquet files easily in python from such partitioned directories in s3? I believe listing all directories and then reading is not good practice as suggested in this link . I would need to convert the read data into the pandas framework for further processing and, therefore, give preference to the options associated with fastparquet or pyarrow. I am also open to other options in python.