BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Reeza
Super User

Here's my solution. Your sample data isn't a good one because it doesn't identify the issues with your data. I changed the last ID so it would. I have no idea about efficiency but it should work. 1.7 million rows isn't very much anywas. I would drop the videos_watched variable at the end, once you've verified this works. 

 

data have;
input person_id video_id $;
cards;
1002           501 
1002           502 
1002           502 
1002           502 
1006           209 
1006           205 
1006           209 
;
run;

data want;
length videos_watched $500.;
set have;
by person_Id video_id notsorted;

retain videos_watched count;

if first.person_id then do;
call missing(videos_watched);
count=0;
end;

if first.video_id then do;
	if index(videos_watched, trim(video_id))=0 then do;
		count+1;
		videos_watched=catx(', ', videos_watched, video_id);
	end;
end;

run;

proc print; run;
fereshteh
Obsidian | Level 7

This has a problem on the big dataset. For unknown reason it gives the same obs_id at each row instaed of creating a new varaible. But it works on the samle sample data. Weired. I finaly managed to do what I wanted to do by creating another variable which is called "vidviewperson_cum " : this variable counts the number of views that the person has had so far from the specific video. for example in last column fourth row this number is 3,which means at obs_id=4 for person_id =1002, this video 502 has been viewd three times. Ideally, I want to create the last column below from teh first three columns:

person_id   video_id      vidviewperson_cum    distinctvideos_watched

1002           501           1                    1
   
1002           502           2                    1
 
1002           502           3                    2

1002           502           4                    3

1006           205           1                    1

1006           205           2                    2
 
1006           209           3                    1

 

here is my final code and it works on 1.7 m observations:

data have;
input person_id video_id obs_id vidviewperson_cum $;
cards;
1002           501           1                

1002           502           2                 

1002           502           3                 

1002           502           4                 

1006           205           1                

1006           205           2               
 
1006           209           3                

;
run;


proc sort data=have;by person_id  video_id  obs_id ;run;
data want1;
  set have;
  vidviewperson_cum + 1;
  by person_id video_id;
  if first.video_id then  vidviewperson_cum = 1;
run;

proc sort data=want1;by person_id  obs_id  ;run;
data want2;set want1;by person_id obs_id ;
retain videos_watched 0;
if vidviewperson_cum=1 then videos_watched=videos_watched+1;
if first.person_id then distinctvideos_watched=1;run;

 
fereshteh
Obsidian | Level 7

This approved answer has a problem on the big dataset. For unknown reason it gives the same obs_id at each row instaed of creating a new varaible. But it works on the samle sample data. Weired. I finaly managed to do what I wanted to do by creating another variable which is called "vidviewperson_cum " : this variable counts the number of views that the person has had so far from the specific video. for example in last column fourth row this number is 3,which means at obs_id=4 for person_id =1002, this video 502 has been viewd three times. Ideally, I want to create the last column below from teh first three columns:

person_id       video_id   obs_id   vidviewperson_cum     distinctvideos_watched

1002               501           1                    1                                 1
1002               502           2                    1                                 2
1002               502           3                    2                                 2

1002               502           4                    3                                 2

1006               205           1                    1                                 1

1006               205           2                    2                                 1
1006               209           3                    1                                 2

 

here is my final code and it works on 1.7 m observations:

data have;
input person_id video_id obs_id $;
cards;
1002  501     1                
1002  502     2                 
1002  502     3                 
1002  502     4                 
1006  205     1                
1006  205     2               
1006  209     3                

;
run;
 
proc sort data=have;by person_id  video_id  obs_id ;run;
data want1;
  set have;
  vidviewperson_cum + 1;
  by person_id video_id;
  if first.video_id then  vidviewperson_cum = 1;
run;

proc sort data=want1;by person_id  obs_id  ;run;
data want2;set want1;by person_id obs_id ;
retain videos_watched 0;
if vidviewperson_cum=1 then videos_watched=videos_watched+1;
if first.person_id then distinctvideos_watched=1;run;

 

Reeza
Super User

This is why I mentioned that your test data is important. Interestingly enough, your code does not work on the test data I created, where a person watches a video A, then video B, then rewatches video A. 

 

But if it works for you, that's all that matters. 

 

 

Just a note that you had mentioned that you couldn't sort by video_id, but you ended up doing that in your solution anyways....

fereshteh
Obsidian | Level 7
Tnx I should not sort by video_id the way it was suggested. I mean there is no way to sort and create the variable directly. I did it because I created an auxiliary variable. I will try my code again on your data soon.
fereshteh
Obsidian | Level 7
Oh I ran your code again after adding obs_id column and it gave me different result. In your sample data there is no column called obs_id, but I do not understand why if we add another column it gets mixed up.
Reeza
Super User

It might be best to start over here...post some sample data, that reflects a few of the weird situations you may run into and I can try again. Situations that may come up include watching Video A, Video B, Video A, Video A. Watching only 1 video, Watching unique video's, multiple repeats aross different times Video A, Video B, Video A, Video B etc. 

 

Unless your solution genuinely works for you then it doesn't matter 🙂

fereshteh
Obsidian | Level 7
It does not solve my problem but I use this to create another variable
Tom
Super User Tom
Super User

The get a running count of distinct videos viewed you need to have an indication of whether this record indicates a new video or not.

One way to do that is sort the data properly and create a flag variable with that information. Then you re sort the data and use the flag to generate the running count.

Here is your data with a small change on the 4th record so show that multiple videos could be viewed multiple times.

data have;
 input person_id video_id want_running_count ;
 if person_id ne lag(person_id) then obs_id=0;
 obs_id + 1;
cards;
1002 501 1
1002 502 2
1002 502 2
1002 501 2
1006 205 1
1006 205 1
1006 209 2
;;;;

 So first sort and process to make a flag to indicate when the video is first watched.

proc sort data=have out=want ;
  by person_id video_id obs_id ;
run;
data want ;
 set want ;
 by person_id video_id obs_id;
 first_view = first.video_id ;
run;

Then put back into the orignal order and create the running count.

proc sort;
  by person_id obs_id;
run;
data want ;
 set want;
 by person_id obs_id;
 if first.person_id then running_count=0;
 running_count + first_view ;
run;

Capture.PNG

Reeza
Super User

@Tom This ignores the at each point in time. So if you have a case that's the one in my data, (Video 205, 209, 205), it won't work correctly. But the answer given also doesn't match this criteria, which is why I think the OP should re-think their problem and figure out what they want first. 

 

EDIT: I'm wrong, this solution and OP's should be fine. 

 

 

fereshteh
Obsidian | Level 7
@ Reeza I am checking my code again. Last night it looked as if it was ok on the first person_ids. But my code on your sample data does not work. So let me check it again and maybe I post portion of real sequence observed in data.
fereshteh
Obsidian | Level 7

@Reeza my code works on my data. I just double checked it for 1500 rows. Here's a real sequence observed:

 

data have;input person_id video_id obs_id want_running_count ; cards;
1079 132 1 1
1079 59 2 2
1079 135 3 3
1079 143 4 4
1079 303 5 5
1079 143 6 5
1079 133 7 6
1079 135 8 6
1079 265 9 7
1079 143 10 7
1079 464 11 8
1079 464 12 8
1079 759 13 9
1079 464 14 9
1079 891 15 10
;;

 

fereshteh
Obsidian | Level 7
This is kind of similar to my idea for solving this problem and it seems to work. Thank you.
Tom
Super User Tom
Super User

You could do the same thing with PROC SQL.  That would avoid the explicit sorts, but probably not the actual sorts as PROC SQL will just do them for you.

 

proc sql ;
create table want as
select a.person_id,a.obs_id,a.video_id,sum(b.first_view) as running_total
from have a inner join 
( select person_id,video_id,obs_id,obs_id=min(obs_id) as first_view
  from have group by 1,2
) b
on a.person_id = b.person_id and a.obs_id >= b.obs_id
group by 1,2,3
order by 1,2,3
;
fereshteh
Obsidian | Level 7
Cool ! I need to learn using sub-querries within a query

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 29 replies
  • 9332 views
  • 12 likes
  • 6 in conversation