Re: How do I find the most commonly occurring sex, across different ev...

Annabeth · Posted 06-06-2019 02:35 AM

Hi there,

I have data from 2005 to 2016. The counting unit of the data is event_id, each row is one event id and has demographic data associated with that event. One of the variables in the dataset is a unique person identifier we shall call it personid, a person may have many different events. My boss would like me to count unique people by sex. However, as the data is of interesting quality you might have say 7 different events for one personid and they might say FEMALE 5 times then MALE once and unknown once. I would like to, over the entire collection (forgetting about year), attach the most commonly occurring sex to the unique person identifier, then I guess join the associated event ids to the person identifier for each year. I will make another data set for unique people at this point that I can pull from.

Since there are like over 400,000 records/events, how do I go about finding the most occurring value for sex and attach it to the person identifier? (the more efficient the better). Using SAS 9.4

Thank you. An example of what my data looks like below:

event id PERSONID sex other demographic year

86590 767898 FEMALE xxx 2005

86591 767898 FEMALE xxx 2005

86592 767898 MALE xxx 2007

86593 767898 UNKNOWN xxx 2008

Thanks in advance for your help!

Jagadishkatam · Posted 06-06-2019 04:00 AM

please try the proc sql

proc sql;
create table want as select count(personid) as count, eventid,personid,sex,year from have 
group by eventid,personid,sex,year
order by eventid,personid,sex,year;
quit;

Thanks,
Jag

Kurt_Bremser · Posted 06-06-2019 04:51 AM

You need to create an intermediary table for use in a subquery (I tried to pack it into a single subquery, but that would either fail or become quite unwieldy)

data have;
input event_id $ personid $ sex $ other_dem $ year;
datalines;
86590 767898 FEMALE  xxx 2005
86591 767898 FEMALE  xxx 2005
86592 767898 MALE    xxx 2007
86593 767898 UNKNOWN xxx 2008
;

proc sql;
create table sex as
select personid, sex, count(*) as count from have
group by personid, sex
;
create table want as
select
  a.event_id,
  a.personid,
  (
    select b.sex
    from sex b
    where b.personid = a.personid
    group by personid
    having count = max(count)
  ) as sex,
  a.other_dem,
  a.year
from have a
;
quit;

Note how I presented your example data in readily usable form by putting it into a data step with datalines. Please do so yourself in the future, as it makes helping you a lot easier for us.

You can also do it in a data step by using double do loops:

proc sort data=have;
by personid sex;
run;

data want;
maxcount = 0;
do until (last.personid);
  count = 0;
  do until (last.sex);
    set have;
    by personid sex;
    count + 1;
  end;
  if count > maxcount
  then do;
    maxcount = count;
    newsex = sex;
  end;
end;
do until (last.personid);
  set have;
  by personid;
  sex = newsex;
  output;
end;
drop count maxcount newsex;
run;

proc sort data=want;
by event_id;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 06-06-2019 05:11 AM

What should happen in case of ties?

@Kurt_Bremser I believe your SQL falls over if there are ties.

Kurt_Bremser · Posted 06-06-2019 06:08 AM

Yep. I'm much more confident with the data step for this, as the preceding sort may be used to set a preference in case of ties.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Annabeth · Posted 06-10-2019 08:49 PM

Hi there,

Thank you for all of your help! I have been on holiday so only just been able to use and the do loop is working well!
When you say that you could set a preference in the preceding sort in the case of ties - how would I do that? Sorry I am such a beginner! For instance, when there is FEMALE and UNKNOWN or MALE and UNKNOWN that are ties I would like it to not be unknown and if there is FEMALE or MALE, I would like it to be deleted!

How would you do that?

Thank you in advance.

Kurt_Bremser · Posted 06-11-2019 01:29 AM

With your original data, UNKNOWN filters itself out if it's part of a tie. Keep in mind that it is the last in the alphabetic sorting sequence (F-M-U), and if its count is not larger than any other count, F or M will take precedence.

For a tie of F and M, you can check that in and after the first do loop in the data step. Keep separate counters for F and M, and if they are equal after the do loop, set to unknown or missing.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Annabeth · Posted 06-11-2019 02:50 AM

Amazing. Thank you!

How do I find the most commonly occurring sex, across different events, for unique personid?

Re: How do I find the most commonly occurring sex, across different events, for unique personid?

Re: How do I find the most commonly occurring sex, across different events, for unique personid?

Re: How do I find the most commonly occurring sex, across different events, for unique personid?

Re: How do I find the most commonly occurring sex, across different events, for unique personid?

Re: How do I find the most commonly occurring sex, across different events, for unique personid?

Re: How do I find the most commonly occurring sex, across different events, for unique personid?

Re: How do I find the most commonly occurring sex, across different events, for unique personid?

SAS Innovate 2025: Call for Content

Classroom Training Available!