Hello all,
In my data, I would like to sum columns based on multiple column criteria. Notmally, for example, if I had a single column to sum like ‘gender’ I would just sum it based on year-id and I could tell how many women I have per year per organization. However, I have 3 columns and I would like to how often each of their unique combinations occurs per year-id. So, I have variables named year-id, description (which is a text based description), and ethnicity (binary variable). There is not a unique id for each observation, something like “1997-101 Tall 1’ or ‘1998-102 Short 0’ could appear several times and I want to count how many unique combinations of each observation there are. Also, there are many different description types. If someone could give me some suggestions, I would very much appreciate it!
Thanks in advance.
I don't fully understand what you're asking for so below just as a thought options how you could generate an additional variable Group ID - just to make further coding simpler.
data have;
infile datalines truncover;
input (yearid description gender) ($) ethnicity otherVar $;
datalines;
1 aaa F 0 W
1 bbb M 1 X
1 aaa F 0 Y
2 aaa F 0 Z
;
run;
/* option 1 */
data option1;
set have;
groupID=put(md5(upcase(catx('|',yearid,description,gender,ethnicity))),hex32.);
run;
/* option 2 */
proc sort data=have out=option2;
by yearid description gender ethnicity;
run;
data option2;
set option2;
by yearid description gender ethnicity;
if first.ethnicity then groupID+1;
run;
MD5() used in option 1 is deterministic and will for the same source string always return the same hash value (digest value). This can be useful to combine data from different sources.
I've added the upcase() funtion to make the hash value generation case insensitive.
Perhaps you could show us a representative portion of your data, following these instructions, and then also show us the desired output.
Also, in one place you talk about a sum, but in another place you say "I want to count how many unique combinations of each observation there are". So we need additional clarification there as well.
PROC FREQ allows you to count combinations. If that's what you need, it's fairly straightforward. For example:
proc freq data=have;
tables yearid * description * gender / missing list;
run;
You might be searching for a slightly different set of counts, but the approach would likely remain the same.
I don't fully understand what you're asking for so below just as a thought options how you could generate an additional variable Group ID - just to make further coding simpler.
data have;
infile datalines truncover;
input (yearid description gender) ($) ethnicity otherVar $;
datalines;
1 aaa F 0 W
1 bbb M 1 X
1 aaa F 0 Y
2 aaa F 0 Z
;
run;
/* option 1 */
data option1;
set have;
groupID=put(md5(upcase(catx('|',yearid,description,gender,ethnicity))),hex32.);
run;
/* option 2 */
proc sort data=have out=option2;
by yearid description gender ethnicity;
run;
data option2;
set option2;
by yearid description gender ethnicity;
if first.ethnicity then groupID+1;
run;
MD5() used in option 1 is deterministic and will for the same source string always return the same hash value (digest value). This can be useful to combine data from different sources.
I've added the upcase() funtion to make the hash value generation case insensitive.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.