Solved: Re: Running Count Distinct by Group - Page 2

Reeza · Posted 04-01-2016 03:43 PM

Here's my solution. Your sample data isn't a good one because it doesn't identify the issues with your data. I changed the last ID so it would. I have no idea about efficiency but it should work. 1.7 million rows isn't very much anywas. I would drop the videos_watched variable at the end, once you've verified this works.

data have;
input person_id video_id $;
cards;
1002           501 
1002           502 
1002           502 
1002           502 
1006           209 
1006           205 
1006           209 
;
run;

data want;
length videos_watched $500.;
set have;
by person_Id video_id notsorted;

retain videos_watched count;

if first.person_id then do;
call missing(videos_watched);
count=0;
end;

if first.video_id then do;
	if index(videos_watched, trim(video_id))=0 then do;
		count+1;
		videos_watched=catx(', ', videos_watched, video_id);
	end;
end;

run;

proc print; run;

fereshteh · Posted 04-01-2016 05:41 PM

This has a problem on the big dataset. For unknown reason it gives the same obs_id at each row instaed of creating a new varaible. But it works on the samle sample data. Weired. I finaly managed to do what I wanted to do by creating another variable which is called "vidviewperson_cum " : this variable counts the number of views that the person has had so far from the specific video. for example in last column fourth row this number is 3,which means at obs_id=4 for person_id =1002, this video 502 has been viewd three times. Ideally, I want to create the last column below from teh first three columns:

person_id   video_id      vidviewperson_cum    distinctvideos_watched

1002           501           1                    1

1002           502           2                    1

1002           502           3                    2

1002           502           4                    3

1006           205           1                    1

1006           205           2                    2

1006           209           3                    1

here is my final code and it works on 1.7 m observations:

data have;
input person_id video_id obs_id vidviewperson_cum $;
cards;
1002           501           1                

1002           502           2                 

1002           502           3                 

1002           502           4                 

1006           205           1                

1006           205           2               
 
1006           209           3                

;
run;


proc sort data=have;by person_id  video_id  obs_id ;run;
data want1;
  set have;
  vidviewperson_cum + 1;
  by person_id video_id;
  if first.video_id then  vidviewperson_cum = 1;
run;

proc sort data=want1;by person_id  obs_id  ;run;
data want2;set want1;by person_id obs_id ;
retain videos_watched 0;
if vidviewperson_cum=1 then videos_watched=videos_watched+1;
if first.person_id then distinctvideos_watched=1;run;

fereshteh · Posted 04-01-2016 05:50 PM

This approved answer has a problem on the big dataset. For unknown reason it gives the same obs_id at each row instaed of creating a new varaible. But it works on the samle sample data. Weired. I finaly managed to do what I wanted to do by creating another variable which is called "vidviewperson_cum " : this variable counts the number of views that the person has had so far from the specific video. for example in last column fourth row this number is 3,which means at obs_id=4 for person_id =1002, this video 502 has been viewd three times. Ideally, I want to create the last column below from teh first three columns:

person_id video_id obs_id vidviewperson_cum distinctvideos_watched

1002     501           1                    1                                 1
1002               502           2                    1                                 2
1002               502           3                  2                                 2

1002 502 4 3 2

1006 205 1 1 1

1006 205 2 2 1
1006 209 3 1 2

here is my final code and it works on 1.7 m observations:

data have;
input person_id video_id obs_id $;
cards;
1002  501     1                
1002  502     2                 
1002  502     3                 
1002  502     4                 
1006  205     1                
1006  205     2               
1006  209     3                

;
run;
 
proc sort data=have;by person_id  video_id  obs_id ;run;
data want1;
  set have;
  vidviewperson_cum + 1;
  by person_id video_id;
  if first.video_id then  vidviewperson_cum = 1;
run;

proc sort data=want1;by person_id  obs_id  ;run;
data want2;set want1;by person_id obs_id ;
retain videos_watched 0;
if vidviewperson_cum=1 then videos_watched=videos_watched+1;
if first.person_id then distinctvideos_watched=1;run;

Reeza · Posted 04-01-2016 06:13 PM

This is why I mentioned that your test data is important. Interestingly enough, your code does not work on the test data I created, where a person watches a video A, then video B, then rewatches video A.

But if it works for you, that's all that matters.

Just a note that you had mentioned that you couldn't sort by video_id, but you ended up doing that in your solution anyways....

fereshteh · Posted 04-01-2016 06:24 PM

Tnx I should not sort by video_id the way it was suggested. I mean there is no way to sort and create the variable directly. I did it because I created an auxiliary variable. I will try my code again on your data soon.

fereshteh · Posted 04-02-2016 04:42 PM

Oh I ran your code again after adding obs_id column and it gave me different result. In your sample data there is no column called obs_id, but I do not understand why if we add another column it gets mixed up.

Reeza · Posted 04-02-2016 05:19 PM

It might be best to start over here...post some sample data, that reflects a few of the weird situations you may run into and I can try again. Situations that may come up include watching Video A, Video B, Video A, Video A. Watching only 1 video, Watching unique video's, multiple repeats aross different times Video A, Video B, Video A, Video B etc.

Unless your solution genuinely works for you then it doesn't matter 🙂

fereshteh · Posted 04-01-2016 04:14 PM

It does not solve my problem but I use this to create another variable

Tom · Posted 04-02-2016 08:29 PM

The get a running count of distinct videos viewed you need to have an indication of whether this record indicates a new video or not.

One way to do that is sort the data properly and create a flag variable with that information. Then you re sort the data and use the flag to generate the running count.

Here is your data with a small change on the 4th record so show that multiple videos could be viewed multiple times.

data have;
 input person_id video_id want_running_count ;
 if person_id ne lag(person_id) then obs_id=0;
 obs_id + 1;
cards;
1002 501 1
1002 502 2
1002 502 2
1002 501 2
1006 205 1
1006 205 1
1006 209 2
;;;;

So first sort and process to make a flag to indicate when the video is first watched.

proc sort data=have out=want ;
  by person_id video_id obs_id ;
run;
data want ;
 set want ;
 by person_id video_id obs_id;
 first_view = first.video_id ;
run;

Then put back into the orignal order and create the running count.

proc sort;
  by person_id obs_id;
run;
data want ;
 set want;
 by person_id obs_id;
 if first.person_id then running_count=0;
 running_count + first_view ;
run;

Reeza · Posted 04-02-2016 08:34 PM

@Tom This ignores the at each point in time. So if you have a case that's the one in my data, (Video 205, 209, 205), it won't work correctly. But the answer given also doesn't match this criteria, which is why I think the OP should re-think their problem and figure out what they want first.

EDIT: I'm wrong, this solution and OP's should be fine.

fereshteh · Posted 04-02-2016 08:47 PM

@ Reeza I am checking my code again. Last night it looked as if it was ok on the first person_ids. But my code on your sample data does not work. So let me check it again and maybe I post portion of real sequence observed in data.

fereshteh · Posted 04-02-2016 09:38 PM

@Reeza my code works on my data. I just double checked it for 1500 rows. Here's a real sequence observed:

data have;input person_id video_id obs_id want_running_count ; cards;
1079 132 1 1 
1079 59 2 2 
1079 135 3 3 
1079 143 4 4 
1079 303 5 5 
1079 143 6 5 
1079 133 7 6 
1079 135 8 6 
1079 265 9 7 
1079 143 10 7 
1079 464 11 8 
1079 464 12 8 
1079 759 13 9 
1079 464 14 9 
1079 891 15 10
;;

fereshteh · Posted 04-02-2016 09:40 PM

This is kind of similar to my idea for solving this problem and it seems to work. Thank you.

Tom · Posted 04-03-2016 09:38 AM

You could do the same thing with PROC SQL. That would avoid the explicit sorts, but probably not the actual sorts as PROC SQL will just do them for you.

proc sql ;
create table want as
select a.person_id,a.obs_id,a.video_id,sum(b.first_view) as running_total
from have a inner join 
( select person_id,video_id,obs_id,obs_id=min(obs_id) as first_view
  from have group by 1,2
) b
on a.person_id = b.person_id and a.obs_id >= b.obs_id
group by 1,2,3
order by 1,2,3
;

fereshteh · Posted 04-03-2016 09:48 PM

Cool ! I need to learn using sub-querries within a query

Registration is open