I need to read parquet data from aws s3. If I use aws sdk for this, I can get the input stream as follows:
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey));
InputStream inputStream = object.getObjectContent();
But the Apache parquet reader only uses the local file as follows:
ParquetReader<Group> reader =
ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath()))
.withConf(conf)
.build();
reader.read()
So, I do not know how to use input for parquet. For example, for CSV files there is a CSVParser that uses the input stream.
I know the solution to use a spark for this purpose. Like this:
SparkSession spark = SparkSession
.builder()
.getOrCreate();
Dataset<Row> ds = spark.read().parquet("s3a://bucketName/file.parquet");
But I canβt use a spark.
Can someone tell me any solutions for reading parquet data from s3?
source
share