DATA Step, Macro, Functions and more

Help: Collapsing multiple observations with different variable values

Accepted Solution Solved
Reply
Contributor
Posts: 34
Accepted Solution

Help: Collapsing multiple observations with different variable values

[ Edited ]

Hi all,

 

I am having a really tough time figuring this out for some reason! I have a bunch of observations, some with the same ID, that I need to collapse so there is just one observation per ID. I put an example below of what I currently have and what I want (I do have some other variables that are associated with each ID that I want to keep, like facility name, but didn't include those here). So, some facilities may only have data for one contaminant (ID=2) whilst others have data for all (ID=4). Thank you!

 

Current:

 

FacilityID      Contaminant     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc   

1                      a                          10                    .                         .                          .                          .      

1                      c                           .                      .                          8                        .                           . 

2                      e                           .                       .                         .                         .                         50

3                      b                           .                      2                        .                         .                            .  

3                      d                           .                       .                          .                        12                          .  

4                      a                           75                   .                          .                           .                          . 

4                      b                            .                      5                        .                           .                           . 

4                      c                            .                      .                          1                         .                           .   

4                      d                            .                     .                            .                         25                        . 

4                      e                          .                       .                            .                           .                       40

 

Want this way:

 

FacilityID      Contaminant     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc

1                      a                          10                    .                          8                          .                          . 

2                      e                           .                       .                         .                         .                         50 

3                      b                           .                      2                        .                         12                          .

4                      a                           75                   5                        1                        25                       40 

 


Accepted Solutions
Solution
‎01-17-2017 10:49 AM
Respected Advisor
Posts: 3,124

Re: Help: Collapsing multiple observations with different variable values

The following solution, first seen by @data_null__,  may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.  

data have;
input FacilityID$      Contaminant$     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc ;
cards; 
1                      a                          10                    .                         .                          .                          .      
1                      c                           .                      .                          8                        .                           . 
2                      e                           .                       .                         .                         .                         50
3                      b                           .                      2                        .                         .                            .  
3                      d                           .                       .                          .                        12                          .  
4                      a                           75                   .                          .                           .                          . 
4                      b                            .                      5                        .                           .                           . 
4                      c                            .                      .                          1                         .                           .   
4                      d                            .                     .                            .                         25                        . 
4                      e                          .                       .                            .                           .                       40 
;

data want;
update have(obs=0) have;
by FacilityID;
run;

 

It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.

View solution in original post


All Replies
Super User
Super User
Posts: 7,407

Re: Help: Collapsing multiple observations with different variable values

[ Edited ]

Assuming that contaminent always matches result:

data want;
  merge have (keep=facilityid conta_conc where=(conta_conc ne.))
            have (keep=facilityid contb_conc where=(contb_conc ne.))
            have (keep=facilityid contc_conc where=(contc_conc ne.))
            have (keep=facilityid contd_conc where=(contd_conc ne.));
  by facilityid;
run;

/* This code to find last non missing */
data want;
 set want;
  array vals{4} conta_conc contb_conc contc_conc contd_conc;
  do i=1 to 4;
    if vals{i} ne . and contaminent="" then contaminent=vname(vals{i});
  end;
run;

Note not tested (post test data in form of datastep to get tested code).

Contributor
Posts: 34

Re: Help: Collapsing multiple observations with different variable values

Thanks. That didn't seem to work for me as each facility is still listed with multiple observations. I created a small sample dataset and then put the code below that where I separated out each contaminant into a new variable, giving it the value from the 'value' variable. I may not need that, but that was one of the first steps I took in trying to work towards collapsing the observations based on FacilityID.

 

data contam;
    input FacilityID FacilityName $ Contaminant $ Value;
    datalines;
            1            Harry         a           5
            1           Harry         d           10
            2           Samantha      b           2
            2           Samantha      c           8
            2           Samantha      e           30
            3           Abby          a           2
            4           Ned              a           5
            4           Ned           b           8
            4           Ned           c           20
            4           Ned           d           30
            4           Ned           e           55
;
run;

data Contam;
    set Contam;
    if Contaminant='a' and Value>0 then a_conc=Value;
    if Contaminant='b' and Value>0 then b_conc=Value;
    if Contaminant='c' and Value>0 then c_conc=Value;
    if Contaminant='d' and Value>0 then d_conc=Value;
    if Contaminant='e' and Value>0 then e_conc=Value;
run;

Super User
Super User
Posts: 7,407

Re: Help: Collapsing multiple observations with different variable values

Sorry, updated.  Note, not tested, post test data as a datastep if you want tested code!

Solution
‎01-17-2017 10:49 AM
Respected Advisor
Posts: 3,124

Re: Help: Collapsing multiple observations with different variable values

The following solution, first seen by @data_null__,  may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.  

data have;
input FacilityID$      Contaminant$     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc ;
cards; 
1                      a                          10                    .                         .                          .                          .      
1                      c                           .                      .                          8                        .                           . 
2                      e                           .                       .                         .                         .                         50
3                      b                           .                      2                        .                         .                            .  
3                      d                           .                       .                          .                        12                          .  
4                      a                           75                   .                          .                           .                          . 
4                      b                            .                      5                        .                           .                           . 
4                      c                            .                      .                          1                         .                           .   
4                      d                            .                     .                            .                         25                        . 
4                      e                          .                       .                            .                           .                       40 
;

data want;
update have(obs=0) have;
by FacilityID;
run;

 

It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.

Respected Advisor
Posts: 3,777

Re: Help: Collapsing multiple observations with different variable values

@Haikuo agree that contaminate is not needed in the flat data as it is coded into the the variable names already and the contaminate would probably be dropped. 

 

However if it was needed it could be got using this modification.

 

data want;
   if 0 then set have;
   update have(obs=0 drop=Contaminant) have(drop=Contaminant);
   by FacilityID;
   if first.FacilityID then set have(keep=Contaminant) point=_n_;
   run;
Regular Contributor
Posts: 194

Re: Help: Collapsing multiple observations with different variable values

[ Edited ]

With a proc sql. I removed the Contaminant column. If you want to keep it, further processing is needed in order to keep 1 row per Id.

 

proc sql;
CREATE TABLE WANT AS
SELECT FacilityID, 
       sum(ContA_conc) AS ContA_conc, 
       sum(ContB_conc) AS ContB_conc,
       sum(ContC_conc) AS ContC_conc,
       sum(ContD_conc) AS ContD_conc,
       sum(ContE_conc) AS ContE_conc
FROM have
GROUP BY FacilityID
HAVING 1
;
quit;
Contributor
Posts: 34

Re: Help: Collapsing multiple observations with different variable values

Thanks @gamotte and @Haikuo - both sets of code seem to work for structuring the data in the way that I wanted it! Smiley Happy

Super User
Posts: 10,516

Re: Help: Collapsing multiple observations with different variable values

I will submit that including contaminant in the "collapsed" may not be correct as it would imply that was the only continimant.

 

Is there some reason that a struture of

FacilityId   Contaminant Concentration

1               a                     10

1               c                      8

 

would not work?

 

Think about what happens if more contaminants get added to the data? Maintaining additional variables may be a lot of work down stream.

Contributor
Posts: 34

Re: Help: Collapsing multiple observations with different variable values

[ Edited ]

I was hoping to just generate a spreadsheet with one row per facility to quickly see which contaminants were tested for that facility and if there is a given value. I guess that could probably be summarised using proc report or something perhaps but wanted it in an easier to read format I could just export to Excel for those who don't use SAS. In this case, I don't think we'll be adding additional contaminants. Oh, and I remembered that I need to create a new variable based on the sum of some of those contaminants, so I also thought it would be easier if everything was 1 row per facility (although I could be wrong!).

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 258 views
  • 6 likes
  • 6 in conversation