当前位置: 动力学知识库 > 问答 > 编程问答 >

apache pig - Count distinct values in a group using pig

问题描述:

My problem in a general sense is that I'd like to group my data and then count the uniq values for a field.

Specifically, for the data below, I want to group by 'category' and 'year' and then count the uniq values for 'food'.

category,id,mydate,mystore,food

catA,myid_1,2014-03-11 13:13:13,store1,apple

catA,myid_2,2014-03-11 12:12:12,store1,milk

catA,myid_3,2014-08-11 10:13:13,store1,apple

catA,myid_4,2014-09-11 09:12:12,store1,milk

catA,myid_5,2015-09-01 10:10:10,store1,milk

catB,myid_6,2014-03-12 03:03:03,store2,milk

catB,myid_7,2014-03-12 05:55:55,store2,apple

This is as far as I can get, which is just picking out the values and using some of the neat pig date functions:

a = load '$input' using PigStorage(',') as (category:chararray,id:chararray,mydate:chararray,mystore:chararray,food:chararray);

b = foreach a generate category, id, ToDate(mydate,'yyyy-MM-dd HH:mm:ss') as myDt:DateTime, mystore,food;

c = foreach b generate category, GetYear(myDt) as year:int, mystore,food;

dump c;

The output from the alias 'c' is:

(catA,2014,store1,apple)

(catA,2014,store1,milk)

(catA,2014,store1,apple)

(catA,2014,store1,milk)

(catA,2015,store1,milk)

(catB,2014,store2,milk)

(catB,2014,store2,apple)

I want in the end:

catA, 2014, {(apple, 2), (milk, 2)}

catA, 2015, {(milk, 1)}

catB, 2014, {(apple, 1), (milk, 1)}

I've seen some example of generating value counts, but grouping by category and year is tripping me up.

网友答案:

Input:

category,id,mydate,mystore,food

catA,myid_1,2014-03-11 13:13:13,store1,apple
catA,myid_2,2014-03-11 12:12:12,store1,milk
catA,myid_3,2014-08-11 10:13:13,store1,apple
catA,myid_4,2014-09-11 09:12:12,store1,milk
catA,myid_5,2015-09-01 10:10:10,store1,milk
catB,myid_6,2014-03-12 03:03:03,store2,milk
catB,myid_7,2014-03-12 05:55:55,store2,apple

Yes, You can use nested FOREACH after your grouping, In that nested FOREACH you can apply Distinct for foods and then you can count that .

The below code will help you

Pig Script:

list = LOAD 'user/cloudera/apple.txt' USING PigStorage(',') AS(category:chararray,id:chararray,mydate:chararray,my_store:chararray,food:chararray);

list_each = FOREACH list GENERATE category,SUBSTRING(mydate,0,4) as my_year, my_store, food;

list_grp = GROUP list_each BY (category,my_year);

list_nested_each = FOREACH list_grp

                            {
                               list_inner_each = FOREACH list_each GENERATE food;
                               list_inner_dist = DISTINCT list_inner_each;

                             GENERATE flatten(group) as (catgeory,my_year), COUNT(list_inner_dist) as no_of_uniq_foods;

                            };

dump list_nested_each;

Output:

(catA,2014,2)
(catA,2015,1)
(catB,2014,2)
网友答案:

Appending to the code in the question:

d = group c by (category, year, food);
e = foreach d generate FLATTEN(group), COUNT(c) as count;

will produce:

(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)

The key is to group by 'food' as well. Interesting. Any other insight is welcomed.

分享给朋友:
您可能感兴趣的文章:
随机阅读: