Solved: Re: Help: Collapsing multiple observations with different variable val...

wernie · Posted 01-17-2017 09:54 AM

Hi all,

I am having a really tough time figuring this out for some reason! I have a bunch of observations, some with the same ID, that I need to collapse so there is just one observation per ID. I put an example below of what I currently have and what I want (I do have some other variables that are associated with each ID that I want to keep, like facility name, but didn't include those here). So, some facilities may only have data for one contaminant (ID=2) whilst others have data for all (ID=4). Thank you!

Current:

FacilityID Contaminant ContA_conc ContB_conc ContC_conc ContD_conc ContE_conc

1 a 10 . . . .

1 c . . 8 . .

2 e . . . . 50

3 b . 2 . . .

3 d . . . 12 .

4 a 75 . . . .

4 b . 5 . . .

4 c . . 1 . .

4 d . . . 25 .

4 e . . . . 40

Want this way:

FacilityID Contaminant ContA_conc ContB_conc ContC_conc ContD_conc ContE_conc

1 a 10 . 8 . .

2 e . . . . 50

3 b . 2 . 12 .

4 a 75 5 1 25 40

Haikuo · Posted 01-17-2017 10:22 AM

The following solution, first seen by @data_null__, may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.

data have;
input FacilityID$      Contaminant$     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc ;
cards; 
1                      a                          10                    .                         .                          .                          .      
1                      c                           .                      .                          8                        .                           . 
2                      e                           .                       .                         .                         .                         50
3                      b                           .                      2                        .                         .                            .  
3                      d                           .                       .                          .                        12                          .  
4                      a                           75                   .                          .                           .                          . 
4                      b                            .                      5                        .                           .                           . 
4                      c                            .                      .                          1                         .                           .   
4                      d                            .                     .                            .                         25                        . 
4                      e                          .                       .                            .                           .                       40 
;

data want;
update have(obs=0) have;
by FacilityID;
run;

It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.

View solution in original post

RW9 · Posted 01-17-2017 10:08 AM

Assuming that contaminent always matches result:

data want;
  merge have (keep=facilityid conta_conc where=(conta_conc ne.))
            have (keep=facilityid contb_conc where=(contb_conc ne.))
            have (keep=facilityid contc_conc where=(contc_conc ne.))
            have (keep=facilityid contd_conc where=(contd_conc ne.));
  by facilityid;
run;

/* This code to find last non missing */
data want;
 set want;
  array vals{4} conta_conc contb_conc contc_conc contd_conc;
  do i=1 to 4;
    if vals{i} ne . and contaminent="" then contaminent=vname(vals{i});
  end;
run;

Note not tested (post test data in form of datastep to get tested code).

wernie · Posted 01-17-2017 10:32 AM

Thanks. That didn't seem to work for me as each facility is still listed with multiple observations. I created a small sample dataset and then put the code below that where I separated out each contaminant into a new variable, giving it the value from the 'value' variable. I may not need that, but that was one of the first steps I took in trying to work towards collapsing the observations based on FacilityID.

data contam;
   input FacilityID FacilityName $ Contaminant $ Value;
   datalines;
           1           Harry         a           5
           1           Harry         d           10
           2           Samantha      b           2
           2           Samantha      c           8
           2           Samantha      e           30
           3           Abby          a           2
           4           Ned              a           5
           4           Ned           b           8
           4           Ned           c           20
           4           Ned           d           30
           4           Ned           e           55
;
run;

data Contam;
   set Contam;
   if Contaminant='a' and Value>0 then a_conc=Value;
   if Contaminant='b' and Value>0 then b_conc=Value;
   if Contaminant='c' and Value>0 then c_conc=Value;
   if Contaminant='d' and Value>0 then d_conc=Value;
   if Contaminant='e' and Value>0 then e_conc=Value;
run;

RW9 · Posted 01-17-2017 10:39 AM

Sorry, updated. Note, not tested, post test data as a datastep if you want tested code!

Haikuo · Posted 01-17-2017 10:22 AM

The following solution, first seen by @data_null__, may seem surreal if you are not familiar with 'Update statement', but it does its job beautifully.

data have;
input FacilityID$      Contaminant$     ContA_conc     ContB_conc     ContC_conc     ContD_conc    ContE_conc ;
cards; 
1                      a                          10                    .                         .                          .                          .      
1                      c                           .                      .                          8                        .                           . 
2                      e                           .                       .                         .                         .                         50
3                      b                           .                      2                        .                         .                            .  
3                      d                           .                       .                          .                        12                          .  
4                      a                           75                   .                          .                           .                          . 
4                      b                            .                      5                        .                           .                           . 
4                      c                            .                      .                          1                         .                           .   
4                      d                            .                     .                            .                         25                        . 
4                      e                          .                       .                            .                           .                       40 
;

data want;
update have(obs=0) have;
by FacilityID;
run;

It does keep the last contaminant instead of the first one. From your description, it seems to me that it doesn't matter to your purpose.

data_null__ · Posted 01-17-2017 12:31 PM

@Haikuo agree that contaminate is not needed in the flat data as it is coded into the the variable names already and the contaminate would probably be dropped.

However if it was needed it could be got using this modification.

data want;
   if 0 then set have;
   update have(obs=0 drop=Contaminant) have(drop=Contaminant);
   by FacilityID;
   if first.FacilityID then set have(keep=Contaminant) point=_n_;
   run;

gamotte · Posted 01-17-2017 10:30 AM

With a proc sql. I removed the Contaminant column. If you want to keep it, further processing is needed in order to keep 1 row per Id.

proc sql;
CREATE TABLE WANT AS
SELECT FacilityID, 
       sum(ContA_conc) AS ContA_conc, 
       sum(ContB_conc) AS ContB_conc,
       sum(ContC_conc) AS ContC_conc,
       sum(ContD_conc) AS ContD_conc,
       sum(ContE_conc) AS ContE_conc
FROM have
GROUP BY FacilityID
HAVING 1
;
quit;

wernie · Posted 01-17-2017 10:44 AM

Thanks @gamotte and @Haikuo - both sets of code seem to work for structuring the data in the way that I wanted it! 🙂

ballardw · Posted 01-17-2017 10:36 AM

I will submit that including contaminant in the "collapsed" may not be correct as it would imply that was the only continimant.

Is there some reason that a struture of

FacilityId Contaminant Concentration

1 a 10

1 c 8

would not work?

Think about what happens if more contaminants get added to the data? Maintaining additional variables may be a lot of work down stream.

wernie · Posted 01-17-2017 10:43 AM

I was hoping to just generate a spreadsheet with one row per facility to quickly see which contaminants were tested for that facility and if there is a given value. I guess that could probably be summarised using proc report or something perhaps but wanted it in an easier to read format I could just export to Excel for those who don't use SAS. In this case, I don't think we'll be adding additional contaminants. Oh, and I remembered that I need to create a new variable based on the sum of some of those contaminants, so I also thought it would be easier if everything was 1 row per facility (although I could be wrong!).

Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Re: Help: Collapsing multiple observations with different variable values

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away