How to convert a 500 GB SQL Server table in Apache Parquet?

This may well be well documented, but I'm very confused about how to do this (there are many Apache tools).

When I create an SQL table, I create a table using the following commands:

CREATE TABLE table_name(
   column1 datatype,
   column2 datatype,
   column3 datatype,
   .....
   columnN datatype,
   PRIMARY KEY( one or more columns )
);

How to convert this table to parquet? Is this file written to disk? If the source data is several GB, how long does it take to wait?

Is it possible to format the initial source data in the Parquet format?

+4
source share
2 answers

Apache Spark can be used for this:

1.load your table from mysql via jdbc
2.save it as a parquet file

Example:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})
df.write.parquet("YOUR_HDFS_FILE")
+5
source

Sqoop ( Sq l, oop). :

Sqoop (), MySQL Oracle , Hadoop (HDFS).

+1

Source: https://habr.com/ru/post/1665951/


All Articles