BookmarkSubscribeRSS Feed
Annabeth
Calcite | Level 5

Hi there, 

 

I have data from 2005 to 2016. The counting unit of the data is event_id, each row is one event id and has demographic data associated with that event. One of the variables in the dataset is a unique person identifier we shall call it personid, a person may have many different events. My boss would like me to count unique people by sex. However, as the data is of interesting quality you might have say 7 different events for one personid and they might say FEMALE 5 times then MALE once and unknown once. I would like to, over the entire collection (forgetting about year), attach the most commonly occurring sex to the unique person identifier, then I guess join the associated event ids to the person identifier for each year. I will make another data set for unique people at this point that I can pull from. 

Since there are like over 400,000 records/events, how do I go about finding the most occurring value for sex and attach it to the person identifier? (the more efficient the better). Using SAS 9.4 

 

Thank you. An example of what my data looks like below:

 

event id              PERSONID             sex                   other demographic    year

86590                767898                    FEMALE          xxx                             2005

86591                767898                    FEMALE          xxx                             2005

86592                767898                    MALE              xxx                              2007

86593                767898                    UNKNOWN     xxx                              2008

 

Thanks in advance for your help! 

7 REPLIES 7
Jagadishkatam
Amethyst | Level 16

please try the proc sql

 

proc sql;
create table want as select count(personid) as count, eventid,personid,sex,year from have 
group by eventid,personid,sex,year
order by eventid,personid,sex,year;
quit;
Thanks,
Jag
Kurt_Bremser
Super User

You need to create an intermediary table for use in a subquery (I tried to pack it into a single subquery, but that would either fail or become quite unwieldy)

data have;
input event_id $ personid $ sex $ other_dem $ year;
datalines;
86590 767898 FEMALE  xxx 2005
86591 767898 FEMALE  xxx 2005
86592 767898 MALE    xxx 2007
86593 767898 UNKNOWN xxx 2008
;

proc sql;
create table sex as
select personid, sex, count(*) as count from have
group by personid, sex
;
create table want as
select
  a.event_id,
  a.personid,
  (
    select b.sex
    from sex b
    where b.personid = a.personid
    group by personid
    having count = max(count)
  ) as sex,
  a.other_dem,
  a.year
from have a
;
quit; 

Note how I presented your example data in readily usable form by putting it into a data step with datalines. Please do so yourself in the future, as it makes helping you a lot easier for us.

 

You can also do it in a data step by using double do loops:

proc sort data=have;
by personid sex;
run;

data want;
maxcount = 0;
do until (last.personid);
  count = 0;
  do until (last.sex);
    set have;
    by personid sex;
    count + 1;
  end;
  if count > maxcount
  then do;
    maxcount = count;
    newsex = sex;
  end;
end;
do until (last.personid);
  set have;
  by personid;
  sex = newsex;
  output;
end;
drop count maxcount newsex;
run;

proc sort data=want;
by event_id;
run;
Patrick
Opal | Level 21

What should happen in case of ties?

@Kurt_Bremser  I believe your SQL falls over if there are ties.

Annabeth
Calcite | Level 5
Hi there,

Thank you for all of your help! I have been on holiday so only just been able to use and the do loop is working well!
When you say that you could set a preference in the preceding sort in the case of ties - how would I do that? Sorry I am such a beginner! For instance, when there is FEMALE and UNKNOWN or MALE and UNKNOWN that are ties I would like it to not be unknown and if there is FEMALE or MALE, I would like it to be deleted!

How would you do that?

Thank you in advance.
Kurt_Bremser
Super User

With your original data, UNKNOWN filters itself out if it's part of a tie. Keep in mind that it is the last in the alphabetic sorting sequence (F-M-U), and if its count is not larger than any other count, F or M will take precedence.

 

For a tie of F and M, you can check that in and after the first do loop in the data step. Keep separate counters for F and M, and if they are equal after the do loop, set to unknown or missing.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2281 views
  • 1 like
  • 4 in conversation