BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wernie
Quartz | Level 8

Hi all,

 

I am having a really tough time figuring this out for some reason! I have a bunch of observations, some with the same ID, that I need to collapse so there is just one observation per ID. I put an example below of what I currently have and what I want (I do have some other variables that are associated with each ID that I want to keep, like facility name, but didn't include those here). So, some facilities may only have data for one contaminant (ID=2) whilst others have data for all (ID=4). Thank you!

 

Current:

 

FacilityID      Contaminant     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc   

1                      a                          10                    .                         .                          .                          .      

1                      c                           .                      .                          8                        .                           . 

2                      e                           .                       .                         .                         .                         50

3                      b                           .                      2                        .                         .                            .  

3                      d                           .                       .                          .                        12                          .  

4                      a                           75                   .                          .                           .                          . 

4                      b                            .                      5                        .                           .                           . 

4                      c                            .                      .                          1                         .                           .   

4                      d                            .                     .                            .                         25                        . 

4                      e                          .                       .                            .                           .                       40

 

Want this way:

 

FacilityID      Contaminant     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc

1                      a                          10                    .                          8                          .                          . 

2                      e                           .                       .                         .                         .                         50 

3                      b                           .                      2                        .                         12                          .

4                      a                           75                   5                        1                        25                       40 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

The following solution, first seen by @data_null__,  may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.  

data have;
input FacilityID$      Contaminant$     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc ;
cards; 
1                      a                          10                    .                         .                          .                          .      
1                      c                           .                      .                          8                        .                           . 
2                      e                           .                       .                         .                         .                         50
3                      b                           .                      2                        .                         .                            .  
3                      d                           .                       .                          .                        12                          .  
4                      a                           75                   .                          .                           .                          . 
4                      b                            .                      5                        .                           .                           . 
4                      c                            .                      .                          1                         .                           .   
4                      d                            .                     .                            .                         25                        . 
4                      e                          .                       .                            .                           .                       40 
;

data want;
update have(obs=0) have;
by FacilityID;
run;

 

It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.

View solution in original post

9 REPLIES 9
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Assuming that contaminent always matches result:

data want;
  merge have (keep=facilityid conta_conc where=(conta_conc ne.))
            have (keep=facilityid contb_conc where=(contb_conc ne.))
            have (keep=facilityid contc_conc where=(contc_conc ne.))
            have (keep=facilityid contd_conc where=(contd_conc ne.));
  by facilityid;
run;

/* This code to find last non missing */
data want;
 set want;
  array vals{4} conta_conc contb_conc contc_conc contd_conc;
  do i=1 to 4;
    if vals{i} ne . and contaminent="" then contaminent=vname(vals{i});
  end;
run;

Note not tested (post test data in form of datastep to get tested code).

wernie
Quartz | Level 8

Thanks. That didn't seem to work for me as each facility is still listed with multiple observations. I created a small sample dataset and then put the code below that where I separated out each contaminant into a new variable, giving it the value from the 'value' variable. I may not need that, but that was one of the first steps I took in trying to work towards collapsing the observations based on FacilityID.

 

data contam;
    input FacilityID FacilityName $ Contaminant $ Value;
    datalines;
            1            Harry         a           5
            1           Harry         d           10
            2           Samantha      b           2
            2           Samantha      c           8
            2           Samantha      e           30
            3           Abby          a           2
            4           Ned              a           5
            4           Ned           b           8
            4           Ned           c           20
            4           Ned           d           30
            4           Ned           e           55
;
run;

data Contam;
    set Contam;
    if Contaminant='a' and Value>0 then a_conc=Value;
    if Contaminant='b' and Value>0 then b_conc=Value;
    if Contaminant='c' and Value>0 then c_conc=Value;
    if Contaminant='d' and Value>0 then d_conc=Value;
    if Contaminant='e' and Value>0 then e_conc=Value;
run;

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Sorry, updated.  Note, not tested, post test data as a datastep if you want tested code!

Haikuo
Onyx | Level 15

The following solution, first seen by @data_null__,  may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.  

data have;
input FacilityID$      Contaminant$     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc ;
cards; 
1                      a                          10                    .                         .                          .                          .      
1                      c                           .                      .                          8                        .                           . 
2                      e                           .                       .                         .                         .                         50
3                      b                           .                      2                        .                         .                            .  
3                      d                           .                       .                          .                        12                          .  
4                      a                           75                   .                          .                           .                          . 
4                      b                            .                      5                        .                           .                           . 
4                      c                            .                      .                          1                         .                           .   
4                      d                            .                     .                            .                         25                        . 
4                      e                          .                       .                            .                           .                       40 
;

data want;
update have(obs=0) have;
by FacilityID;
run;

 

It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.

data_null__
Jade | Level 19

@Haikuo agree that contaminate is not needed in the flat data as it is coded into the the variable names already and the contaminate would probably be dropped. 

 

However if it was needed it could be got using this modification.

 

data want;
   if 0 then set have;
   update have(obs=0 drop=Contaminant) have(drop=Contaminant);
   by FacilityID;
   if first.FacilityID then set have(keep=Contaminant) point=_n_;
   run;
gamotte
Rhodochrosite | Level 12

With a proc sql. I removed the Contaminant column. If you want to keep it, further processing is needed in order to keep 1 row per Id.

 

proc sql;
CREATE TABLE WANT AS
SELECT FacilityID, 
       sum(ContA_conc) AS ContA_conc, 
       sum(ContB_conc) AS ContB_conc,
       sum(ContC_conc) AS ContC_conc,
       sum(ContD_conc) AS ContD_conc,
       sum(ContE_conc) AS ContE_conc
FROM have
GROUP BY FacilityID
HAVING 1
;
quit;
wernie
Quartz | Level 8

Thanks @gamotte and @Haikuo - both sets of code seem to work for structuring the data in the way that I wanted it! 🙂

ballardw
Super User

I will submit that including contaminant in the "collapsed" may not be correct as it would imply that was the only continimant.

 

Is there some reason that a struture of

FacilityId   Contaminant Concentration

1               a                     10

1               c                      8

 

would not work?

 

Think about what happens if more contaminants get added to the data? Maintaining additional variables may be a lot of work down stream.

wernie
Quartz | Level 8

I was hoping to just generate a spreadsheet with one row per facility to quickly see which contaminants were tested for that facility and if there is a given value. I guess that could probably be summarised using proc report or something perhaps but wanted it in an easier to read format I could just export to Excel for those who don't use SAS. In this case, I don't think we'll be adding additional contaminants. Oh, and I remembered that I need to create a new variable based on the sum of some of those contaminants, so I also thought it would be easier if everything was 1 row per facility (although I could be wrong!).

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2344 views
  • 6 likes
  • 6 in conversation