SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Running Count Distinct by Group

Accepted Solution Solved
Reply
Contributor
Posts: 20
Accepted Solution

Running Count Distinct by Group

Hi All,

 

I am working on a panel data. Multiple individuals view different videos. Does any one know how I can create running distinct count of videos viewed within each group (peron_id) ? At each pont of time, I would like to see how many distinct videos the person has watched. Obs_id is created based on time order of viewing events. My desired table looks like below:

 

person_id   video_id  running_count    obs_id

1002           501          1                        1

1002           502          2                        2

1002           502          2                        3

1002           502          2                        4

1006           205          1                        1

1006           205          1

1006           209          2

 

So far, I have tried this:

proc sql;
create table final3 as
SELECT person_id,
video_id,
count(distinct video_id) OVER
(ORDER BY person_id)
AS running_total
FROM final2;

 

but error message pops up for the use of "OVER" here. 

Any ideas how I can proceed?

 


Accepted Solutions
Solution
‎04-01-2016 04:08 PM
Super User
Posts: 19,770

Re: Running Count Distinct by Group

Posted in reply to fereshteh

Here's my solution. Your sample data isn't a good one because it doesn't identify the issues with your data. I changed the last ID so it would. I have no idea about efficiency but it should work. 1.7 million rows isn't very much anywas. I would drop the videos_watched variable at the end, once you've verified this works. 

 

data have;
input person_id video_id $;
cards;
1002           501 
1002           502 
1002           502 
1002           502 
1006           209 
1006           205 
1006           209 
;
run;

data want;
length videos_watched $500.;
set have;
by person_Id video_id notsorted;

retain videos_watched count;

if first.person_id then do;
call missing(videos_watched);
count=0;
end;

if first.video_id then do;
	if index(videos_watched, trim(video_id))=0 then do;
		count+1;
		videos_watched=catx(', ', videos_watched, video_id);
	end;
end;

run;

proc print; run;

View solution in original post


All Replies
Super User
Posts: 7,758

Re: Running Count Distinct by Group

[ Edited ]
Posted in reply to fereshteh
proc sort data=have (keep=person_id video_id) nodupkey;
by person_id video_id;
run;

proc summary data=have;
by person_id;
output out=want (keep=person_id _freq_);
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 20

Re: Running Count Distinct by Group

Posted in reply to KurtBremser
Tnx but this does not work. I tried this code and it gives me the number of observations per person.
Super User
Posts: 7,758

Re: Running Count Distinct by Group

Posted in reply to fereshteh

No:

data have;
input person_id video_id;
cards;
1002           501          1                        1
1002           502          2                        2
1002           502          2                        3
1002           502          2                        4
1006           205          1                        1
1006           205          1
1006           209          2
;
run;

proc sort data=have (keep=person_id video_id) nodupkey;
by person_id video_id;
run;

proc summary data=have;
by person_id;
output out=want (keep=person_id _freq_);
run;

proc print;
run;

Gives this output, which only counts distinct video ID's per person:

                                                             person_
                                                      Obs       id      _FREQ_

                                                       1       1002        2  
                                                       2       1006        2  

The count of observations per person would be 4 for 1002 and 3 for 1006

 

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 20

Re: Running Count Distinct by Group

Posted in reply to KurtBremser

This neglects the "at each point of time" and counts distincts videos watched at the latest time for each person. But I was trying to get distinct counts at each point of time. Maybe I was not clear. In my description I wrote point of time is replcaed by Obs_id. For example, at time=1 (obs_id=1) for the first person, distinct videos watched is one. 

Super User
Posts: 5,424

Re: Running Count Distinct by Group

Posted in reply to fereshteh

That's not valid SAS syntax.

And running counts is not really a SQL thing. SQL is column based. If the data is sorted, try a data step instead.

Data never sleeps
Contributor
Posts: 20

Re: Running Count Distinct by Group

I do not get your point. Can you be more clear on how?
Super User
Super User
Posts: 7,942

Re: Running Count Distinct by Group

Posted in reply to fereshteh

Please post test data in the form of a datastep.

Try something like (not tested):

proc sort data=have;
  by person_id video_id;
run;
data want;
  set have;
  by person_id video_id;
  retain running_count;
  if first.video_id then running_count=1;
  else running_count=running_count+1;
run;

 

Contributor
Posts: 20

Re: Running Count Distinct by Group

Ok. Unfortunately, I cannot sort by video_id. That's not what I want to create. I should sort first by person_id then by time (obs_id). I already wrote sth similar to your code but that's another variable of interest.

Super User
Super User
Posts: 7,942

Re: Running Count Distinct by Group

Posted in reply to fereshteh

Ok, thats not clear from your data - labelling a variable obs_id is not a good idea - lable it something pertaining to the data it contains.  The same principal is there regarless of what variables are used:

1) Sort the data so the data appears firt in the highest group, then the next group etc. for each group, and then data point count order.

2) Step through data retaining a count at each record, resetting when you get to the first occurence of a new group.

proc sort data=have;
  by person_id obs_id video_id;
run;
data want;
  set have;
  by person_id obs_id video_id;
  retain running_count;
  if first.video_id then running_count=1;
  else running_count=running_count+1;
run;

Or the same thing can be achieved in SQL with group by person_id, obs_id, video_id.  It is really the grouping that matters.

Contributor
Posts: 20

Re: Running Count Distinct by Group

Oh this gives me a lot of ones for the running count because at each row when it approaches a new video_id it resets the running count to zero. I will post a sample data through data step.

Super User
Posts: 19,770

Re: Running Count Distinct by Group

Posted in reply to fereshteh

Why can't you sort by video_id?

 

Sort it to calculate the count and then resort to what you want.

Contributor
Posts: 20

Re: Running Count Distinct by Group

because that does not give me the variable I would like to create. I would like to know counts of distinct video views AT EACH POINT OF TIME for each person. Data should be sorted by: person_id viewing_date video_id
Then I would like to count distinct views up to each viewing date (=time=obs_id)
Super User
Posts: 19,770

Re: Running Count Distinct by Group

Posted in reply to fereshteh

How big is your dataset, how many potential videos are watched, do you have a max? At each point you're going to have search backwards for the video...I would create a string that stored the video id's retain it across the id. If the id is found then don't increment, otherwise increment and add to list.  Brute force but would work....I'll post some code in a bit. 

Contributor
Posts: 20

Re: Running Count Distinct by Group

I have around 1.7 million rows in the panel. each person watches 20 videos say, on average. tnx!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 29 replies
  • 1568 views
  • 12 likes
  • 6 in conversation