Here's my solution. Your sample data isn't a good one because it doesn't identify the issues with your data. I changed the last ID so it would. I have no idea about efficiency but it should work. 1.7 million rows isn't very much anywas. I would drop the videos_watched variable at the end, once you've verified this works.
data have;
input person_id video_id $;
cards;
1002 501
1002 502
1002 502
1002 502
1006 209
1006 205
1006 209
;
run;
data want;
length videos_watched $500.;
set have;
by person_Id video_id notsorted;
retain videos_watched count;
if first.person_id then do;
call missing(videos_watched);
count=0;
end;
if first.video_id then do;
if index(videos_watched, trim(video_id))=0 then do;
count+1;
videos_watched=catx(', ', videos_watched, video_id);
end;
end;
run;
proc print; run;
This has a problem on the big dataset. For unknown reason it gives the same obs_id at each row instaed of creating a new varaible. But it works on the samle sample data. Weired. I finaly managed to do what I wanted to do by creating another variable which is called "vidviewperson_cum " : this variable counts the number of views that the person has had so far from the specific video. for example in last column fourth row this number is 3,which means at obs_id=4 for person_id =1002, this video 502 has been viewd three times. Ideally, I want to create the last column below from teh first three columns:
person_id video_id vidviewperson_cum distinctvideos_watched
1002 501 1 1
1002 502 2 1
1002 502 3 2
1002 502 4 3
1006 205 1 1
1006 205 2 2
1006 209 3 1
here is my final code and it works on 1.7 m observations:
data have;
input person_id video_id obs_id vidviewperson_cum $;
cards;
1002 501 1
1002 502 2
1002 502 3
1002 502 4
1006 205 1
1006 205 2
1006 209 3
;
run;
proc sort data=have;by person_id video_id obs_id ;run;
data want1;
set have;
vidviewperson_cum + 1;
by person_id video_id;
if first.video_id then vidviewperson_cum = 1;
run;
proc sort data=want1;by person_id obs_id ;run;
data want2;set want1;by person_id obs_id ;
retain videos_watched 0;
if vidviewperson_cum=1 then videos_watched=videos_watched+1;
if first.person_id then distinctvideos_watched=1;run;
This approved answer has a problem on the big dataset. For unknown reason it gives the same obs_id at each row instaed of creating a new varaible. But it works on the samle sample data. Weired. I finaly managed to do what I wanted to do by creating another variable which is called "vidviewperson_cum " : this variable counts the number of views that the person has had so far from the specific video. for example in last column fourth row this number is 3,which means at obs_id=4 for person_id =1002, this video 502 has been viewd three times. Ideally, I want to create the last column below from teh first three columns:
person_id video_id obs_id vidviewperson_cum distinctvideos_watched
1002 501 1 1 1
1002 502 2 1 2
1002 502 3 2 2
1002 502 4 3 2
1006 205 1 1 1
1006 205 2 2 1
1006 209 3 1 2
here is my final code and it works on 1.7 m observations:
data have;
input person_id video_id obs_id $;
cards;
1002 501 1
1002 502 2
1002 502 3
1002 502 4
1006 205 1
1006 205 2
1006 209 3
;
run;
proc sort data=have;by person_id video_id obs_id ;run;
data want1;
set have;
vidviewperson_cum + 1;
by person_id video_id;
if first.video_id then vidviewperson_cum = 1;
run;
proc sort data=want1;by person_id obs_id ;run;
data want2;set want1;by person_id obs_id ;
retain videos_watched 0;
if vidviewperson_cum=1 then videos_watched=videos_watched+1;
if first.person_id then distinctvideos_watched=1;run;
This is why I mentioned that your test data is important. Interestingly enough, your code does not work on the test data I created, where a person watches a video A, then video B, then rewatches video A.
But if it works for you, that's all that matters.
Just a note that you had mentioned that you couldn't sort by video_id, but you ended up doing that in your solution anyways....
It might be best to start over here...post some sample data, that reflects a few of the weird situations you may run into and I can try again. Situations that may come up include watching Video A, Video B, Video A, Video A. Watching only 1 video, Watching unique video's, multiple repeats aross different times Video A, Video B, Video A, Video B etc.
Unless your solution genuinely works for you then it doesn't matter 🙂
The get a running count of distinct videos viewed you need to have an indication of whether this record indicates a new video or not.
One way to do that is sort the data properly and create a flag variable with that information. Then you re sort the data and use the flag to generate the running count.
Here is your data with a small change on the 4th record so show that multiple videos could be viewed multiple times.
data have;
input person_id video_id want_running_count ;
if person_id ne lag(person_id) then obs_id=0;
obs_id + 1;
cards;
1002 501 1
1002 502 2
1002 502 2
1002 501 2
1006 205 1
1006 205 1
1006 209 2
;;;;
So first sort and process to make a flag to indicate when the video is first watched.
proc sort data=have out=want ;
by person_id video_id obs_id ;
run;
data want ;
set want ;
by person_id video_id obs_id;
first_view = first.video_id ;
run;
Then put back into the orignal order and create the running count.
proc sort;
by person_id obs_id;
run;
data want ;
set want;
by person_id obs_id;
if first.person_id then running_count=0;
running_count + first_view ;
run;
@Tom This ignores the at each point in time. So if you have a case that's the one in my data, (Video 205, 209, 205), it won't work correctly. But the answer given also doesn't match this criteria, which is why I think the OP should re-think their problem and figure out what they want first.
EDIT: I'm wrong, this solution and OP's should be fine.
@Reeza my code works on my data. I just double checked it for 1500 rows. Here's a real sequence observed:
data have;input person_id video_id obs_id want_running_count ; cards;
1079 132 1 1
1079 59 2 2
1079 135 3 3
1079 143 4 4
1079 303 5 5
1079 143 6 5
1079 133 7 6
1079 135 8 6
1079 265 9 7
1079 143 10 7
1079 464 11 8
1079 464 12 8
1079 759 13 9
1079 464 14 9
1079 891 15 10
;;
You could do the same thing with PROC SQL. That would avoid the explicit sorts, but probably not the actual sorts as PROC SQL will just do them for you.
proc sql ;
create table want as
select a.person_id,a.obs_id,a.video_id,sum(b.first_view) as running_total
from have a inner join
( select person_id,video_id,obs_id,obs_id=min(obs_id) as first_view
from have group by 1,2
) b
on a.person_id = b.person_id and a.obs_id >= b.obs_id
group by 1,2,3
order by 1,2,3
;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.