How to convert a 500 GB SQL Server table in Apache Parquet?

Question

How to convert a 500 GB SQL Server table in Apache Parquet?

This may well be well documented, but I'm very confused about how to do this (there are many Apache tools).

When I create an SQL table, I create a table using the following commands:

CREATE TABLE table_name(
   column1 datatype,
   column2 datatype,
   column3 datatype,
   .....
   columnN datatype,
   PRIMARY KEY( one or more columns )
);

How to convert this table to parquet? Is this file written to disk? If the source data is several GB, how long does it take to wait?

Is it possible to format the initial source data in the Parquet format?

+4

mysql sql-server hadoop parquet

ShanZhengYang Jan 6 '17 at 3:53

source share

2 answers

Sqoop ( Sq l, oop). :

Sqoop (), MySQL Oracle , Hadoop (HDFS).

+1

Zoltan 06 . '17 13:56

liprais · Accepted Answer · 2017-04-27T03:34:31+0000

Apache Spark can be used for this:

1.load your table from mysql via jdbc
2.save it as a parquet file

Example:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.jdbc("YOUR_MYSQL_JDBC_CONN_STRING",  "YOUR_TABLE",properties={"user": "YOUR_USER", "password": "YOUR_PASSWORD"})
df.write.parquet("YOUR_HDFS_FILE")

How to convert a 500 GB SQL Server table in Apache Parquet?

More articles: