04-01-2016 04:51 AM
I am working on a panel data. Multiple individuals view different videos. Does any one know how I can create running distinct count of videos viewed within each group (peron_id) ? At each pont of time, I would like to see how many distinct videos the person has watched. Obs_id is created based on time order of viewing events. My desired table looks like below:
person_id video_id running_count obs_id
1002 501 1 1
1002 502 2 2
1002 502 2 3
1002 502 2 4
1006 205 1 1
1006 205 1
1006 209 2
So far, I have tried this:
create table final3 as
count(distinct video_id) OVER
(ORDER BY person_id)
but error message pops up for the use of "OVER" here.
Any ideas how I can proceed?
04-01-2016 03:43 PM
Here's my solution. Your sample data isn't a good one because it doesn't identify the issues with your data. I changed the last ID so it would. I have no idea about efficiency but it should work. 1.7 million rows isn't very much anywas. I would drop the videos_watched variable at the end, once you've verified this works.
data have; input person_id video_id $; cards; 1002 501 1002 502 1002 502 1002 502 1006 209 1006 205 1006 209 ; run; data want; length videos_watched $500.; set have; by person_Id video_id notsorted; retain videos_watched count; if first.person_id then do; call missing(videos_watched); count=0; end; if first.video_id then do; if index(videos_watched, trim(video_id))=0 then do; count+1; videos_watched=catx(', ', videos_watched, video_id); end; end; run; proc print; run;
04-01-2016 05:07 AM - edited 04-01-2016 05:08 AM
proc sort data=have (keep=person_id video_id) nodupkey; by person_id video_id; run; proc summary data=have; by person_id; output out=want (keep=person_id _freq_); run;
04-01-2016 05:18 AM
data have; input person_id video_id; cards; 1002 501 1 1 1002 502 2 2 1002 502 2 3 1002 502 2 4 1006 205 1 1 1006 205 1 1006 209 2 ; run; proc sort data=have (keep=person_id video_id) nodupkey; by person_id video_id; run; proc summary data=have; by person_id; output out=want (keep=person_id _freq_); run; proc print; run;
Gives this output, which only counts distinct video ID's per person:
person_ Obs id _FREQ_ 1 1002 2 2 1006 2
The count of observations per person would be 4 for 1002 and 3 for 1006
04-01-2016 05:24 AM
This neglects the "at each point of time" and counts distincts videos watched at the latest time for each person. But I was trying to get distinct counts at each point of time. Maybe I was not clear. In my description I wrote point of time is replcaed by Obs_id. For example, at time=1 (obs_id=1) for the first person, distinct videos watched is one.
04-01-2016 05:11 AM
That's not valid SAS syntax.
And running counts is not really a SQL thing. SQL is column based. If the data is sorted, try a data step instead.
04-01-2016 05:20 AM
Please post test data in the form of a datastep.
Try something like (not tested):
proc sort data=have; by person_id video_id; run; data want; set have; by person_id video_id; retain running_count; if first.video_id then running_count=1; else running_count=running_count+1; run;
04-01-2016 05:35 AM
Ok. Unfortunately, I cannot sort by video_id. That's not what I want to create. I should sort first by person_id then by time (obs_id). I already wrote sth similar to your code but that's another variable of interest.
04-01-2016 05:45 AM
Ok, thats not clear from your data - labelling a variable obs_id is not a good idea - lable it something pertaining to the data it contains. The same principal is there regarless of what variables are used:
1) Sort the data so the data appears firt in the highest group, then the next group etc. for each group, and then data point count order.
2) Step through data retaining a count at each record, resetting when you get to the first occurence of a new group.
proc sort data=have; by person_id obs_id video_id; run; data want; set have; by person_id obs_id video_id; retain running_count; if first.video_id then running_count=1; else running_count=running_count+1; run;
Or the same thing can be achieved in SQL with group by person_id, obs_id, video_id. It is really the grouping that matters.
04-01-2016 01:10 PM
Oh this gives me a lot of ones for the running count because at each row when it approaches a new video_id it resets the running count to zero. I will post a sample data through data step.
04-01-2016 01:17 PM
04-01-2016 01:24 PM
How big is your dataset, how many potential videos are watched, do you have a max? At each point you're going to have search backwards for the video...I would create a string that stored the video id's retain it across the id. If the id is found then don't increment, otherwise increment and add to list. Brute force but would work....I'll post some code in a bit.
Need further help from the community? Please ask a new question.