Transfer 50 TB of data from your local Hadoop cluster to Google Cloud storage

I am trying to migrate existing data (JSON) to my Hadoop cluster on Google Cloud Storage.

I studied GSUtil and it seems like this is the recommended option for moving large datasets to GCS. It looks like it can handle huge amounts of data. It seems that GSUtil can only move data from the local machine to GCS or S3 ↔ GCS, however it cannot move data from the local Hadoop cluster.

  • What is the recommended way to move data from a local Hadoop cluster to GCS?

  • In the case of GSUtil, can it directly transfer data from the local Hadoop cluster (HDFS) to GCS, or do you first need to copy the files on a machine running GSUtil and then transfer it to GCS?

  • What are the advantages and disadvantages of using the Google Client Side (Java API) and GSUtil libraries?

Many thanks,

+5
source share
2 answers

Question 1: The recommended way to move data from a local Hadoop cluster to GCS is to use the Google Cloud Storage Connector for Hadoop . The instructions on this site are mainly for running Hadoop on Google Compute Engine virtual machines, but you can also download the GCS connector directly, gcs-connector-1.2.8-hadoop1.jar if you are using Hadoop 1.x or Hadoop 0.20.x , or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.

Just copy the jarfile to hadoop / lib or $HADOOP_COMMON_LIB_JARS_DIR in case of Hadoop 2:

 cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/ 

You may also need to add the following to hadoop / conf / hadoop-env.sh if you use 0.20.x:

 export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar 

Then, you most likely want to use the "keyfile" authentication of the service account, since you are in place of the Hadoop cluster. Visit your cloud.google.com/console, find the APIs & auth on the left, click Credentials , if you don’t yet have one click Create new Client ID , select Service account before clicking Create client id , and then for now Connection requires a ".p12" type keypair, so click Generate new P12 key and track the downloaded .p12 file. It may be convenient to rename it before placing it in a directory more accessible from Hadoop, for example:

 cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12 

Add the following entries in the Hadoop file to your core-site.xml file:

 <property> <name>fs.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value> </property> <property> <name>fs.gs.project.id</name> <value>your-ascii-google-project-id</value> </property> <property> <name>fs.gs.system.bucket</name> <value>some-bucket-your-project-owns</value> </property> <property> <name>fs.gs.working.dir</name> <value>/</value> </property> <property> <name>fs.gs.auth.service.account.enable</name> <value>true</value> </property> <property> <name>fs.gs.auth.service.account.email</name> <value> your-service-account-email@developer.gserviceaccount.com </value> </property> <property> <name>fs.gs.auth.service.account.keyfile</name> <value>/path/to/hadoop/conf/gcskey.p12</value> </property> 

Usually the fs.gs.system.bucket file will not be used, except for some cases for files with temporary files, you may just need to create a one-time bucket for it. With these settings on your host, you should already check hadoop fs -ls gs://the-bucket-you-want to-list . At this point, you can already try to merge all the data from the master node using the simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket .

If you want to speed things up with Hadoop distcp, sync lib / gcs-connector-1.2.8-hadoop1.jar and conf / core-site.xml with all your Hadoop nodes and everything should work as expected. Please note that no need to restart datanodes or namenodes.

Question 2: Although the GCS connector for Hadoop can copy directly from HDFS without requiring an additional disk buffer, GSUtil cannot, because it cannot interpret the HDFS protocol; he only knows how to handle the actual files of the local file system or, as you said, the GCS / S3 files.

Question 3: The advantage of using the Java API is flexibility; you can choose how to handle errors, repetitions, buffer sizes, etc., but this requires more work and planning. Using gsutil is suitable for quick use, and you inherit many processing and testing errors from Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it is open source, you can see what things need to be done to work smoothly here in your GitHub source code: https://github.com/GoogleCloudPlatform/ bigdata-interop / blob / master / gcs / src / main / java / com / google / cloud / hadoop / gcsio / GoogleCloudStorageImpl.java

+13
source

It looks like property names have changed in recent versions.

`String serviceAccount =" service-account@test.gserviceaccount.com ";

String key file = "/path/to/local/keyfile.p12";

hasoopConfiguration.set ("google.cloud.auth.service.account.enable", true); hasoopConfiguration.set ("google.cloud.auth.service.account.email", serviceAccount); hasoopConfiguration.set ("google.cloud.auth.service.account.keyfile", keyfile); `

+2
source

Source: https://habr.com/ru/post/1200243/


All Articles