Hi All,
I am working on a panel data. Multiple individuals view different videos. Does any one know how I can create running distinct count of videos viewed within each group (peron_id) ? At each pont of time, I would like to see how many distinct videos the person has watched. Obs_id is created based on time order of viewing events. My desired table looks like below:
person_id video_id running_count obs_id
1002 501 1 1
1002 502 2 2
1002 502 2 3
1002 502 2 4
1006 205 1 1
1006 205 1
1006 209 2
So far, I have tried this:
proc sql;
create table final3 as
SELECT person_id,
video_id,
count(distinct video_id) OVER
(ORDER BY person_id)
AS running_total
FROM final2;
but error message pops up for the use of "OVER" here.
Any ideas how I can proceed?
Here's my solution. Your sample data isn't a good one because it doesn't identify the issues with your data. I changed the last ID so it would. I have no idea about efficiency but it should work. 1.7 million rows isn't very much anywas. I would drop the videos_watched variable at the end, once you've verified this works.
data have;
input person_id video_id $;
cards;
1002 501
1002 502
1002 502
1002 502
1006 209
1006 205
1006 209
;
run;
data want;
length videos_watched $500.;
set have;
by person_Id video_id notsorted;
retain videos_watched count;
if first.person_id then do;
call missing(videos_watched);
count=0;
end;
if first.video_id then do;
if index(videos_watched, trim(video_id))=0 then do;
count+1;
videos_watched=catx(', ', videos_watched, video_id);
end;
end;
run;
proc print; run;
proc sort data=have (keep=person_id video_id) nodupkey;
by person_id video_id;
run;
proc summary data=have;
by person_id;
output out=want (keep=person_id _freq_);
run;
No:
data have;
input person_id video_id;
cards;
1002 501 1 1
1002 502 2 2
1002 502 2 3
1002 502 2 4
1006 205 1 1
1006 205 1
1006 209 2
;
run;
proc sort data=have (keep=person_id video_id) nodupkey;
by person_id video_id;
run;
proc summary data=have;
by person_id;
output out=want (keep=person_id _freq_);
run;
proc print;
run;
Gives this output, which only counts distinct video ID's per person:
person_ Obs id _FREQ_ 1 1002 2 2 1006 2
The count of observations per person would be 4 for 1002 and 3 for 1006
This neglects the "at each point of time" and counts distincts videos watched at the latest time for each person. But I was trying to get distinct counts at each point of time. Maybe I was not clear. In my description I wrote point of time is replcaed by Obs_id. For example, at time=1 (obs_id=1) for the first person, distinct videos watched is one.
That's not valid SAS syntax.
And running counts is not really a SQL thing. SQL is column based. If the data is sorted, try a data step instead.
Please post test data in the form of a datastep.
Try something like (not tested):
proc sort data=have; by person_id video_id; run; data want; set have; by person_id video_id; retain running_count; if first.video_id then running_count=1; else running_count=running_count+1; run;
Ok. Unfortunately, I cannot sort by video_id. That's not what I want to create. I should sort first by person_id then by time (obs_id). I already wrote sth similar to your code but that's another variable of interest.
Ok, thats not clear from your data - labelling a variable obs_id is not a good idea - lable it something pertaining to the data it contains. The same principal is there regarless of what variables are used:
1) Sort the data so the data appears firt in the highest group, then the next group etc. for each group, and then data point count order.
2) Step through data retaining a count at each record, resetting when you get to the first occurence of a new group.
proc sort data=have; by person_id obs_id video_id; run; data want; set have; by person_id obs_id video_id; retain running_count; if first.video_id then running_count=1; else running_count=running_count+1; run;
Or the same thing can be achieved in SQL with group by person_id, obs_id, video_id. It is really the grouping that matters.
Oh this gives me a lot of ones for the running count because at each row when it approaches a new video_id it resets the running count to zero. I will post a sample data through data step.
Why can't you sort by video_id?
Sort it to calculate the count and then resort to what you want.
How big is your dataset, how many potential videos are watched, do you have a max? At each point you're going to have search backwards for the video...I would create a string that stored the video id's retain it across the id. If the id is found then don't increment, otherwise increment and add to list. Brute force but would work....I'll post some code in a bit.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.