Can we directly upload the Parquet file to Hive?

I know that we can load a parquet file using Spark SQL and use Impala, but I wonder if we can do the same with Hive. I read a lot of articles, but I'm still confused.

Simply put, I have a parquet file - say, users.parquet. Now I am amazed here how to load / paste / import data from user.parquet into a hive (obviously in a table).

Please let me know in the right direction if I miss something obvious.

Creating a hive table using parquet file metadata

https://phdata.io/examples-using-textfile-and-parquet-with-hive-and-impala/

+5
source share
4 answers

Get the layout of the parquet file using parquet tools, for more information check the link http://kitesdk.org/docs/0.17.1/labs/4-using-parquet-tools-solution.html

and build a table using the diagram at the top of the file for more information about checking Create a Hive table for reading parquet files from the parquet diagram / avro

+5
source

Getting the schema is crucial, since you will need to create a table with the corresponding schema first in Hive, and then specify it in the parquet files.

I had a similar problem when I had data in one virtual machine and I had to move it to another. Here is my walkthrough:

  • Find out about the original Parquet files (location and layout): describe formatted users; and show create table users; The latter will immediately get you a diagram, and also indicate the location of HDFS hdfs://hostname:port/apps/hive/warehouse/users

  • Learn about table show partitions users;

  • Copy the Parquet files table from HDFS to the local directory

     hdfs dfs -copyToLocal /apps/hive/warehouse/users 
  • Move them to another cluster / virtual machine or wherever you want them to go

  • Create a users table in your target CREATE USERS ... using the same schema

     CREATE TABLE users ( name string, ... ) PARTITIONED BY (...) STORED AS PARQUET; 
  • Now move the Parquet files to the appropriate folder (if necessary, find out about the location of the table you just created)

     hdfs dfs -ls /apps/hive/warehouse/users/ hdfs dfs -copyFromLocal ../temp/* /apps/hive/warehouse/ 
  • For each partition, you need to specify Hive in the corresponding subdirectory: alter table users add partition (sign_up_date='19991231') location '/apps/hive/warehouse/users/sign_up_date=19991231'; (you can do it with a bash script)

It worked for me, hope it helps.

+4
source

I don’t know if this is a bit “hacked,” but I use zeppelin (comes with ambari). You can simply do the following in conjunction with spark2:

 %spark2 import org.apache.spark.sql.SaveMode var df = spark.read.parquet("hdfs:///my_parquet_files/*.parquet"); df.write.mode(SaveMode.Overwrite).saveAsTable("imported_table") 

The advantage of this method is that you can also import many parquet files, even if they have a different scheme.

+1
source

You can try this ... Export / Import works for all types of files, including parquet in Hive. This is a general concept, you can configure a little based on your requirement, for example, loading from a local (or) cluster

Note. When performing separate steps, you can use hard code instead of $, and also pass the "HDFS path", "Scheme" and "scoreboard" as a parameter when running from a script. This way you can export / import unlimited tables just by passing a parameter

  • Step 1: hive -S -e "export table $ schema_file1. $ Tbl_file1 to '$ HDFS_DATA_PATH / $ tbl_file1';" # - Run from HDFS.
  • Step 2: # - It contains both data and metadata. zip and scp for the target cluster.
  • Step 3: hive -S -e "import table $ schema_file1. $ Tbl_file1 from '$ HDFS_DATA_PATH / $ tbl_file1';" # - The first import through an error, since the table does not exist, but automatically creates the table
  • Step 4: hive -S -e "import table $ schema_file1. $ Tbl_file1 from '$ HDFS_DATA_PATH / $ tbl_file1';" # - The second import imports data without any errors, since the table is currently available

thanks

Kumar

0
source

Source: https://habr.com/ru/post/1238332/


All Articles