BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fereshteh
Obsidian | Level 7

Hi All,

 

I am working on a panel data. Multiple individuals view different videos. Does any one know how I can create running distinct count of videos viewed within each group (peron_id) ? At each pont of time, I would like to see how many distinct videos the person has watched. Obs_id is created based on time order of viewing events. My desired table looks like below:

 

person_id   video_id  running_count    obs_id

1002           501          1                        1

1002           502          2                        2

1002           502          2                        3

1002           502          2                        4

1006           205          1                        1

1006           205          1

1006           209          2

 

So far, I have tried this:

proc sql;
create table final3 as
SELECT person_id,
video_id,
count(distinct video_id) OVER
(ORDER BY person_id)
AS running_total
FROM final2;

 

but error message pops up for the use of "OVER" here. 

Any ideas how I can proceed?

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Here's my solution. Your sample data isn't a good one because it doesn't identify the issues with your data. I changed the last ID so it would. I have no idea about efficiency but it should work. 1.7 million rows isn't very much anywas. I would drop the videos_watched variable at the end, once you've verified this works. 

 

data have;
input person_id video_id $;
cards;
1002           501 
1002           502 
1002           502 
1002           502 
1006           209 
1006           205 
1006           209 
;
run;

data want;
length videos_watched $500.;
set have;
by person_Id video_id notsorted;

retain videos_watched count;

if first.person_id then do;
call missing(videos_watched);
count=0;
end;

if first.video_id then do;
	if index(videos_watched, trim(video_id))=0 then do;
		count+1;
		videos_watched=catx(', ', videos_watched, video_id);
	end;
end;

run;

proc print; run;

View solution in original post

29 REPLIES 29
Kurt_Bremser
Super User
proc sort data=have (keep=person_id video_id) nodupkey;
by person_id video_id;
run;

proc summary data=have;
by person_id;
output out=want (keep=person_id _freq_);
run;
fereshteh
Obsidian | Level 7
Tnx but this does not work. I tried this code and it gives me the number of observations per person.
Kurt_Bremser
Super User

No:

data have;
input person_id video_id;
cards;
1002           501          1                        1
1002           502          2                        2
1002           502          2                        3
1002           502          2                        4
1006           205          1                        1
1006           205          1
1006           209          2
;
run;

proc sort data=have (keep=person_id video_id) nodupkey;
by person_id video_id;
run;

proc summary data=have;
by person_id;
output out=want (keep=person_id _freq_);
run;

proc print;
run;

Gives this output, which only counts distinct video ID's per person:

                                                             person_
                                                      Obs       id      _FREQ_

                                                       1       1002        2  
                                                       2       1006        2  

The count of observations per person would be 4 for 1002 and 3 for 1006

 

fereshteh
Obsidian | Level 7

This neglects the "at each point of time" and counts distincts videos watched at the latest time for each person. But I was trying to get distinct counts at each point of time. Maybe I was not clear. In my description I wrote point of time is replcaed by Obs_id. For example, at time=1 (obs_id=1) for the first person, distinct videos watched is one. 

LinusH
Tourmaline | Level 20

That's not valid SAS syntax.

And running counts is not really a SQL thing. SQL is column based. If the data is sorted, try a data step instead.

Data never sleeps
fereshteh
Obsidian | Level 7
I do not get your point. Can you be more clear on how?
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Please post test data in the form of a datastep.

Try something like (not tested):

proc sort data=have;
  by person_id video_id;
run;
data want;
  set have;
  by person_id video_id;
  retain running_count;
  if first.video_id then running_count=1;
  else running_count=running_count+1;
run;

 

fereshteh
Obsidian | Level 7

Ok. Unfortunately, I cannot sort by video_id. That's not what I want to create. I should sort first by person_id then by time (obs_id). I already wrote sth similar to your code but that's another variable of interest.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ok, thats not clear from your data - labelling a variable obs_id is not a good idea - lable it something pertaining to the data it contains.  The same principal is there regarless of what variables are used:

1) Sort the data so the data appears firt in the highest group, then the next group etc. for each group, and then data point count order.

2) Step through data retaining a count at each record, resetting when you get to the first occurence of a new group.

proc sort data=have;
  by person_id obs_id video_id;
run;
data want;
  set have;
  by person_id obs_id video_id;
  retain running_count;
  if first.video_id then running_count=1;
  else running_count=running_count+1;
run;

Or the same thing can be achieved in SQL with group by person_id, obs_id, video_id.  It is really the grouping that matters.

fereshteh
Obsidian | Level 7

Oh this gives me a lot of ones for the running count because at each row when it approaches a new video_id it resets the running count to zero. I will post a sample data through data step.

Reeza
Super User

Why can't you sort by video_id?

 

Sort it to calculate the count and then resort to what you want.

fereshteh
Obsidian | Level 7
because that does not give me the variable I would like to create. I would like to know counts of distinct video views AT EACH POINT OF TIME for each person. Data should be sorted by: person_id viewing_date video_id
Then I would like to count distinct views up to each viewing date (=time=obs_id)
Reeza
Super User

How big is your dataset, how many potential videos are watched, do you have a max? At each point you're going to have search backwards for the video...I would create a string that stored the video id's retain it across the id. If the id is found then don't increment, otherwise increment and add to list.  Brute force but would work....I'll post some code in a bit. 

fereshteh
Obsidian | Level 7
I have around 1.7 million rows in the panel. each person watches 20 videos say, on average. tnx!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 29 replies
  • 8245 views
  • 12 likes
  • 6 in conversation