Word Count in Hive

I'm trying to learn a hive. Surprisingly, I cannot find an example of how to write a simple word counting task. Is it correct?

Say I have an input file input.tsv :

 hello, world this is an example input file 

I am creating a splitter in Python to turn each line into words:

 import sys for line in sys.stdin: for word in line.split(): print word 

And then I have the following in my hive script:

 CREATE TABLE input (line STRING); LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE input; -- temporary table to hold words... CREATE TABLE words (word STRING); add file splitter.py; INSERT OVERWRITE TABLE words SELECT TRANSFORM(text) USING 'python splitter.py' AS word FROM input; SELECT word, count(*) AS count FROM words GROUP BY word; 

I'm not sure if I missed something, or if it is really that difficult. (In particular, I need a temporary table of words , and do I need to write an external splitter function?)

+6
source share
3 answers

If you want a simple look at the following:

 SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word; 

I use the side view to enable the function of the table function (explode), which takes a list that exits the split function and prints a new line for each value. In practice, I use UDF, which wraps the IBM ICU4J word breaker. I usually do not use conversion scripts and do not use UDF for everything. You do not need a temporary word table.

+12
source
 CREATE TABLE docs (line STRING); LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs; CREATE TABLE word_counts AS SELECT word, count(1) AS count FROM (SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word; 
+2
source

You can assign the built-in UDF to the bush as follows:

1) Step 1. Create a temporary table with one column named sentence array of data types

 create table temp as select sentence from docs lateral view explode(explode(sentences(lcase(line)))) ltable as sentence 

2) Step 2: Select the words from the temp table, again exploding the sentence column

 select words,count(words) CntWords from ( select explode(words) words from temp ) i group by words order by CntWords desc 
+1
source

Source: https://habr.com/ru/post/912568/


All Articles