Getting NULL after creating an external table in Hive using the parquet file as storage

I create an external table in Hive using the parquet file as storage

hive> CREATE EXTERNAL TABLE test_data( c1 string, c2 int, c3 string, c4 string, c5 string, c6 float, c7 string, c8 string, c9 string, c10 string, c11 string, c12 string) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' LOCATION '/path/test_data/'; 

select this table, getting NULL in any rows and columns

 SELECT * FROM test_data; OK NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL Time taken: 0.191 seconds, Fetched: 34 row(s) 

I have a parquet file using the Pig conversion from a tab delimited file using the following sequence

 grunt> A = LOAD '/path/test.data' USING PigStorage('\t') AS ( c1: chararray,c2: int,c3: chararray, c4: chararray,c5: chararray,c6: float, c7: chararray,c8: chararray,c9: chararray, c10: chararray, c11: chararray, c12: chararray ); grunt> STORE A INTO '/path/test_data' USING parquet.pig.ParquetStorer; 

To verify that the parquet file contains valid data, read it.

 grunt> B = LOAD'/path/test_data' USING parquet.pig.ParquetLoader; grunt> DUMP B; (19,14370,rs6054257,G,A,29.0,PASS,NS=3;DP=14;AF=0.5;DB;H2,GT:GQ:DP:HQ,0|0:48:1:51,51,1|0:48:8:51,51,1/1:43:5:.,.) (20,17330,.,T,A,3.0,q10,NS=3;DP=11;AF=0.017,GT:GQ:DP:HQ,0|0:49:3:58,50,0|1:3:5:65,3,0/0:41:3) (20,1110696,rs6040355,A,G,T,67.0,PASS,NS=2;DP=10;AF=0.333,0.667;AA=T;DB,GT:GQ:DP:HQ,1|2:21:6:23,27,2|1:2:0:18,2,2/2:35:4) (20,1230237,.,T,.,47.0,PASS,NS=3;DP=13;AA=T,GT:GQ:DP:HQ,0|0:54:7:56,60,0|0:48:4:51,51,0/0:61:2) (20,1234567,microsat1,GTC,G,GTCTC,50.0,PASS,NS=3;DP=9;AA=G,GT:GQ:DP,0/1:35:4,0/2:17:2,1/1:40:3) (20,2234567,.,C,[13:123457[ACGC,50.0,PASS,SVTYPE=BND;NS=3;DP=9;AA=G,GT:GQ:DP,0/1:35:4,0/1:17:2,1/1:40:3) (20,2234568,.,C,.TC,50.0,PASS,SVTYPE=BND;NS=3;DP=9;AA=G,GT:GQ:DP,0/1:35:4,0/1:17:2,1/1:40:3) (20,2234569,.,C,CT.,50.0,PASS,SVTYPE=BND;NS=3;DP=9;AA=G,GT:GQ:DP,0/1:35:4,0/1:17:2,1/1:40:3) (20,3234569,.,C,<INV>,50.0,PASS,SVTYPE=BND;NS=3;DP=9;AA=G,GT:GQ:DP,0/1:35:4,0/1:17:2,1/1:40:3) (20,4234569,.,N,.[13:123457[,50.0,PASS,SVTYPE=BND;NS=3;DP=9;AA=G,GT:GQ:DP,0/1:35:4,0/1:17:2,./.:40:3) (20,5234569,.,N,[13:123457[.,50.0,PASS,SVTYPE=BND;NS=3;DP=9;AA=G,GT:GQ:DP,0/1:35:4,0/1:17:2,1/1:40:3) (Y,17330,.,T,A,3.0,q10,NS=3;DP=11;AF=0.017,GT:GL,0:0,49,0:0,3,1:41,0) 

What am I doing wrong?

+4
source share
4 answers

In my case, it seemed that Hive was sensitive to column names.

Unloading my parquet file from the Spark framework, I had to use the exact same column name in Hive as the original Spark frame.

When I used common column names such as c1 , I would get NULL values ​​for all the values ​​in this particular column.

+6
source

In my case, the HDFS file was in comma-delimited format, I did not define the line format limited by the command. How:

 CREATE EXTERNAL TABLE TABLENAME(COL1 INT, COL2 STRING....) LOCATION '/user/....'; 

he used to create a table with the same number of rows, but with all values ​​as null. I changed the command above and it works like a spell:

 CREATE EXTERNAL TABLE TABLENAME (COL1 INT, COL2 STRING....) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' LOCATION '/user/.....'; 
+4
source

I also encountered similar problems.

  • The root cause of this problem is circuit mismatch . First check the layout of the parquet file you are using. If it is the result of a process, it contains metadata or metadata. Open and look at the column names and types . Creating a table based on this scheme will help a lot!
  • Also check the compression format (Snappy, GZIP) and specify TBLPROPERTIES ('PARQUET.COMPRESS'='GZIP')
  • I ran into a similar problem when I had to use a Parquet file saved from a Dataframe spark. The column names were different from the table I created, and I got all NULL.
+2
source

Try

CREATE EXTERNAL TABLES test_data (c1 string, c2 int, c3 string, c4 string, c5 string, c6 float, line c7, line c8, line c9, line c10, line c11, line c12) ROW FORMAT SERDE 'parquet.hive.serde .ParquetHiveSerDe ' C SERDEPROPERTIES ("CaseSensitive" = "c1, c2, c3, c4, c5, c6, c7, c8, c9, c10, c11, c12 ") STORED AS INPUTFORMAT' parquet.hive.DeprecatedParquetInputFormat 'OUTPUT. hive.DeprecatedParquetOutputFormat 'LOCATION' / path / test_data / '

0
source

Source: https://habr.com/ru/post/1496967/


All Articles