I have the following dataset for a movie database:
Ratings: UserID, MovieID, Rating Movies: MovieID, Genre Users: UserID, Gender, Age
I wrote a PIG script to get female users in the age group (20-30) who rated the movie with the highest rating. Below is the code that I still have:
users_input = load '/users.dat' USING PigStorage('\u003B') as (UserID: long, gender: chararray, age: int, occupation: int, zip: long);
movies_input = load '/movies.dat' USING PigStorage('\u003B') as (MovieID: long, title: chararray, genre: chararray);
ratings_input = load '/ratings.dat' USING PigStorage('\u003B') as (UserID: long, MovieID: long, rating: int, timestamp: chararray);
movie_filter = filter movies_input by (genre matches '.*Action.*') OR (genre matches '.*War.*');
temp = COGROUP movie_filter by MovieID, ratings_input by MovieID;
temp1 = FILTER temp BY COUNT(movie_filter) > 0;
temp2 = FOREACH temp1 GENERATE group, AVG(ratings_input.rating) AS ratings;
temp3 = ORDER temp2 BY ratings DESC;
temp4 = LIMIT temp3 1;
temp5 = FOREACH temp4 GENERATE ratings;
temp6 = FILTER temp3 BY (temp5.ratings == ratings);
female_users = filter users_input by gender == 'F';
age_users = filter female_users by age >=20 AND age <=30;
age_use = FOREACH age_users GENERATE UserID;
MovID = FOREACH temp6 GENERATE group;
all_users_records = FILTER ratings_input BY (MovID.group == MovieID);
all_users = FOREACH all_users_records GENERATE UserID;
female_aged_records = FILTER all_users BY (UserID == age_use.UserID);
female_aged_users = FOREACH female_aged_records GENERATE UserID;
store all_users into '/output_pig' using PigStorage();
I do this, but I end up with the error: " Scalar has more than one line in the output. 1st: (11), 2nd: (24) "
Can anyone help me out? Thanks in advance.
source
share