Hi all,
I am having a really tough time figuring this out for some reason! I have a bunch of observations, some with the same ID, that I need to collapse so there is just one observation per ID. I put an example below of what I currently have and what I want (I do have some other variables that are associated with each ID that I want to keep, like facility name, but didn't include those here). So, some facilities may only have data for one contaminant (ID=2) whilst others have data for all (ID=4). Thank you!
Current:
FacilityID Contaminant ContA_conc ContB_conc ContC_conc ContD_conc ContE_conc
1 a 10 . . . .
1 c . . 8 . .
2 e . . . . 50
3 b . 2 . . .
3 d . . . 12 .
4 a 75 . . . .
4 b . 5 . . .
4 c . . 1 . .
4 d . . . 25 .
4 e . . . . 40
Want this way:
FacilityID Contaminant ContA_conc ContB_conc ContC_conc ContD_conc ContE_conc
1 a 10 . 8 . .
2 e . . . . 50
3 b . 2 . 12 .
4 a 75 5 1 25 40
The following solution, first seen by @data_null__, may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.
data have;
input FacilityID$ Contaminant$ ContA_conc ContB_conc ContC_conc ContD_conc ContE_conc ;
cards;
1 a 10 . . . .
1 c . . 8 . .
2 e . . . . 50
3 b . 2 . . .
3 d . . . 12 .
4 a 75 . . . .
4 b . 5 . . .
4 c . . 1 . .
4 d . . . 25 .
4 e . . . . 40
;
data want;
update have(obs=0) have;
by FacilityID;
run;
It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.
Assuming that contaminent always matches result:
data want; merge have (keep=facilityid conta_conc where=(conta_conc ne.)) have (keep=facilityid contb_conc where=(contb_conc ne.)) have (keep=facilityid contc_conc where=(contc_conc ne.)) have (keep=facilityid contd_conc where=(contd_conc ne.)); by facilityid; run; /* This code to find last non missing */ data want; set want; array vals{4} conta_conc contb_conc contc_conc contd_conc; do i=1 to 4; if vals{i} ne . and contaminent="" then contaminent=vname(vals{i}); end; run;
Note not tested (post test data in form of datastep to get tested code).
Thanks. That didn't seem to work for me as each facility is still listed with multiple observations. I created a small sample dataset and then put the code below that where I separated out each contaminant into a new variable, giving it the value from the 'value' variable. I may not need that, but that was one of the first steps I took in trying to work towards collapsing the observations based on FacilityID.
data contam;
input FacilityID FacilityName $ Contaminant $ Value;
datalines;
1 Harry a 5
1 Harry d 10
2 Samantha b 2
2 Samantha c 8
2 Samantha e 30
3 Abby a 2
4 Ned a 5
4 Ned b 8
4 Ned c 20
4 Ned d 30
4 Ned e 55
;
run;
data Contam;
set Contam;
if Contaminant='a' and Value>0 then a_conc=Value;
if Contaminant='b' and Value>0 then b_conc=Value;
if Contaminant='c' and Value>0 then c_conc=Value;
if Contaminant='d' and Value>0 then d_conc=Value;
if Contaminant='e' and Value>0 then e_conc=Value;
run;
Sorry, updated. Note, not tested, post test data as a datastep if you want tested code!
The following solution, first seen by @data_null__, may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.
data have;
input FacilityID$ Contaminant$ ContA_conc ContB_conc ContC_conc ContD_conc ContE_conc ;
cards;
1 a 10 . . . .
1 c . . 8 . .
2 e . . . . 50
3 b . 2 . . .
3 d . . . 12 .
4 a 75 . . . .
4 b . 5 . . .
4 c . . 1 . .
4 d . . . 25 .
4 e . . . . 40
;
data want;
update have(obs=0) have;
by FacilityID;
run;
It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.
@Haikuo agree that contaminate is not needed in the flat data as it is coded into the the variable names already and the contaminate would probably be dropped.
However if it was needed it could be got using this modification.
data want;
if 0 then set have;
update have(obs=0 drop=Contaminant) have(drop=Contaminant);
by FacilityID;
if first.FacilityID then set have(keep=Contaminant) point=_n_;
run;
With a proc sql. I removed the Contaminant column. If you want to keep it, further processing is needed in order to keep 1 row per Id.
proc sql;
CREATE TABLE WANT AS
SELECT FacilityID,
sum(ContA_conc) AS ContA_conc,
sum(ContB_conc) AS ContB_conc,
sum(ContC_conc) AS ContC_conc,
sum(ContD_conc) AS ContD_conc,
sum(ContE_conc) AS ContE_conc
FROM have
GROUP BY FacilityID
HAVING 1
;
quit;
I will submit that including contaminant in the "collapsed" may not be correct as it would imply that was the only continimant.
Is there some reason that a struture of
FacilityId Contaminant Concentration
1 a 10
1 c 8
would not work?
Think about what happens if more contaminants get added to the data? Maintaining additional variables may be a lot of work down stream.
I was hoping to just generate a spreadsheet with one row per facility to quickly see which contaminants were tested for that facility and if there is a given value. I guess that could probably be summarised using proc report or something perhaps but wanted it in an easier to read format I could just export to Excel for those who don't use SAS. In this case, I don't think we'll be adding additional contaminants. Oh, and I remembered that I need to create a new variable based on the sum of some of those contaminants, so I also thought it would be easier if everything was 1 row per facility (although I could be wrong!).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.