Hi there,
I have data from 2005 to 2016. The counting unit of the data is event_id, each row is one event id and has demographic data associated with that event. One of the variables in the dataset is a unique person identifier we shall call it personid, a person may have many different events. My boss would like me to count unique people by sex. However, as the data is of interesting quality you might have say 7 different events for one personid and they might say FEMALE 5 times then MALE once and unknown once. I would like to, over the entire collection (forgetting about year), attach the most commonly occurring sex to the unique person identifier, then I guess join the associated event ids to the person identifier for each year. I will make another data set for unique people at this point that I can pull from.
Since there are like over 400,000 records/events, how do I go about finding the most occurring value for sex and attach it to the person identifier? (the more efficient the better). Using SAS 9.4
Thank you. An example of what my data looks like below:
event id PERSONID sex other demographic year
86590 767898 FEMALE xxx 2005
86591 767898 FEMALE xxx 2005
86592 767898 MALE xxx 2007
86593 767898 UNKNOWN xxx 2008
Thanks in advance for your help!
please try the proc sql
proc sql;
create table want as select count(personid) as count, eventid,personid,sex,year from have
group by eventid,personid,sex,year
order by eventid,personid,sex,year;
quit;
You need to create an intermediary table for use in a subquery (I tried to pack it into a single subquery, but that would either fail or become quite unwieldy)
data have;
input event_id $ personid $ sex $ other_dem $ year;
datalines;
86590 767898 FEMALE xxx 2005
86591 767898 FEMALE xxx 2005
86592 767898 MALE xxx 2007
86593 767898 UNKNOWN xxx 2008
;
proc sql;
create table sex as
select personid, sex, count(*) as count from have
group by personid, sex
;
create table want as
select
a.event_id,
a.personid,
(
select b.sex
from sex b
where b.personid = a.personid
group by personid
having count = max(count)
) as sex,
a.other_dem,
a.year
from have a
;
quit;
Note how I presented your example data in readily usable form by putting it into a data step with datalines. Please do so yourself in the future, as it makes helping you a lot easier for us.
You can also do it in a data step by using double do loops:
proc sort data=have;
by personid sex;
run;
data want;
maxcount = 0;
do until (last.personid);
count = 0;
do until (last.sex);
set have;
by personid sex;
count + 1;
end;
if count > maxcount
then do;
maxcount = count;
newsex = sex;
end;
end;
do until (last.personid);
set have;
by personid;
sex = newsex;
output;
end;
drop count maxcount newsex;
run;
proc sort data=want;
by event_id;
run;
What should happen in case of ties?
@Kurt_Bremser I believe your SQL falls over if there are ties.
Yep. I'm much more confident with the data step for this, as the preceding sort may be used to set a preference in case of ties.
With your original data, UNKNOWN filters itself out if it's part of a tie. Keep in mind that it is the last in the alphabetic sorting sequence (F-M-U), and if its count is not larger than any other count, F or M will take precedence.
For a tie of F and M, you can check that in and after the first do loop in the data step. Keep separate counters for F and M, and if they are equal after the do loop, set to unknown or missing.
Amazing. Thank you!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.