DATA Step, Macro, Functions and more

Collapse multiple rows of data into a single row within a group

Reply
Occasional Contributor
Posts: 12

Collapse multiple rows of data into a single row within a group

Hi all! I'm working in SAS 9.4 with a very large dataset (5 million plus records, 172 columns) that has bits and pieces of information scattered throughout. The unique ID (I'll call UID) is repeated on multiple rows. I want to collapse all instances of a UID into one row, keeping any nonmissing, unique values, without overwriting any other nonmissing, unique values.

 

After hours of searching for solutions, I found this: http://support.sas.com/kb/32/288.html - but alas, it's only for Enterprise Guide. This note describes exactly what I'm trying to do though.

 

Here's a very simplified example of my messy dataset:

 

data have;
   input uid $ otherid1 $ otherid2 $ anotherid1 $ anotherid2 $;
   datalines;
xyz$3tyu 00012345 00123456 03456 78901
xyz$3tyu 00012345 01204789 34512 78901 ; 

 

Super User
Super User
Posts: 8,264

Re: Collapse multiple rows of data into a single row within a group

Posted in reply to fwashburn

Use the UPDATE statement.  When the new record has a missing value it leaves the old value in place.

You need a master and transaction dataset to it to work, but you can use your same dataset as both.  Just start with an empty master dataset.

data want ;
  update have(obs=0) have ;
  by uid;
run;
Occasional Contributor
Posts: 12

Re: Collapse multiple rows of data into a single row within a group

Tom, thank you so much for this information. I've seen documentation on the UPDATE statement; however, it seems that I would need to write out many lines of code just to make sure that none of my 171 non-UID variables get overwritten in the process (based on what I read here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm).

 

While I'm not opposed to writing out a lot of code, it does introduce plenty of room for user error. Do you know if there is a more user-friendly approach?

Super User
Super User
Posts: 8,264

Re: Collapse multiple rows of data into a single row within a group

Posted in reply to fwashburn

Doesn't sound like want to collapse out missing values at all.

Rather it looks like you just want to transpose your data.  That is what the first link was doing.

Instead of going for a wide format couldn't you just go for a tall skinny format?

* Add unique row id ;
data step1 ;
  set have ;
  row+1;
run;
* transpose ;
proc transpose data=step1 out=step2  ;
  by row uid ;
  var _all_;
run;
* find unique combinations;
proc sort nodupkey data=step2 out=step3 ;
  by uid _name_ col1 ;
  where upcase(_name_) not in ('ROW','UID');
  where also not missing(col1);
run;

If you really wanted you could turn that back into one row per UID, but you probably need to add your own duplicate counter so that you could generate unique variable names.

data step4;
  set step3 ;
  by uid _name_;
  colnum +1;
  if first._name_ then colnum=1;
run;
proc transpose data=step4 out=want ;
  by uid;
  id _name_ colnum ;
  var col1;
run;
Super User
Posts: 13,889

Re: Collapse multiple rows of data into a single row within a group

Posted in reply to fwashburn

The example data for the EG in your link is a VERY hard to work with format to work with for a large number of processing and analysis tasks. Note that in the linked example the input data set starts with 3 columns and ends with 9: 1 + (3-1)*4 columns because there are 4 rows with the same id. How many times is your ID variable duplicated for the one that is duplicated the most? Plug that value into: 1+(172-1)* x, where is that number of replications of Id. That is how many variables (columns) the resulting data set will have.

 

Assume that your ID is duplicated at most 6 times. Then you have 1+ 171*6 or 1027 variables.

I left the 1 in the formula in case it actually takes more variables to uniquely identify a record, replace 1 with that number.

 

If each identifier is not repeated the exact same number of times then you also start having lots of missing values.

 

Here is an example using the code similar to what would be generated behind the EG interface working with a data set you should have.

proc sort data=sashelp.class out=work.class;
   by sex;
run;
data work.test;
   set work.class;
   by sex;
   retain N (0);
   if first.sex then N=1;
   else n+1;
run;
proc transpose data=work.test out=work.trans;
   by sex n;
   var name age height weight;
run;

data work.test2;
   set work.trans;
   newname=cats(_name_,N);
run;

proc transpose data=work.test2 
               out=work.want (drop=_name_);
   by sex;
   var col1;
   id newname;
run;

Note that with 9 females and 10 males (sex = F and M) that some of the variables for the F row have missing values.

 

Occasional Contributor
Posts: 12

Re: Collapse multiple rows of data into a single row within a group

BallardW: Thank you for your framing of the problem - I know this is a crazy conundrum and I wish it weren't so complicated; it'd sure make my life a lot easier!

 

Each UID repeats up to 9 times in the 5 million+ record dataset, with the majority of UIDs occurring 1 to 5 times. The 171 non-UID variables comprise 2 sets of other unique identifiers (90 potential values for otherid; 81 potential values for anotherid) that are attached to a UID. A UID may appear twice and have 80 otherid's that are the same in both records, but 10 otherid's that are different. 

 

Basically I want my end dataset to show one record for one UID, with all unique otherid and anotherid values listed out to the right. But I'm not certain that this is possible without some very verbose code.

Super User
Posts: 6,905

Re: Collapse multiple rows of data into a single row within a group

[ Edited ]
Posted in reply to fwashburn

Just to confirm some of the overall guidelines ...

 

For a single UID, it sounds like the order doesn't matter.  If OTHERID1 takes on values ABC, DEF, and GHI, it doesn't matter which value matches with values of the other 171 columns.

 

Do you have any additional helpful information, such as:

 

  • maximum number of observations for a single UID
  • among the 172 other variables, is there always a mix of character and numeric variables?
Occasional Contributor
Posts: 12

Re: Collapse multiple rows of data into a single row within a group

Posted in reply to Astounding

Astounding - thanks for taking a look.

 

A UID can repeat up to 9 times, and each time it repeats, it can have up to 171 variables.

 

Among the 171 other variables, every one is in a character format (though some appear as numeric), because there is a potential for leading zeroes.

 

I may be misunderstanding your statement on the order not mattering, but it definitely matters which values match with the UID. My end goal is to create a comprehensive dataset with one record for one UID, and all unique nonmissing otherids and anotherids listed out. I don't think that there will be more than a few hundred total, and I expect that on average, there will be maybe 5 otherids and 10 anotherids per UID; it's just getting them into the proper structure that's stumping me.

Super User
Super User
Posts: 8,264

Re: Collapse multiple rows of data into a single row within a group

[ Edited ]
Posted in reply to fwashburn

Perhaps the problem is that your starting data is neither all TALL or all WIDE but some sort of mixed bag?

In that case it might be better to use arrays to make the the tall skinny file so that you map all of the variables in the same "array" to same target name.

data tall ;
  set have ;
  length _name_ $32 value $50 ;
  array A otherid: ;
  array B anotherid: ;
  _name_ ='OTHERID';
  do i=1 to dim(A);
     if not missing(a(i)) then do;
        value=a(i);
        output;
    end;
  end;
  _name_ ='ANOTHERID';
  do i=1 to dim(B);
     if not missing(b(i)) then do;
        value=b(i);
        output;
    end;
  end;
 keep uid _name_ value ;
run;

Now you can sort this with NODUPKEY option to find the unique values.

proc sort nodupkey;
  by uid _name_ value;
run;

You can add a counter and transpose it back up if you want to get back to a wide format.

data tall2 ;
  set tall ;
  by uid _name_;
  colnum +1;
  if first._name_ then colnum=1;
run;
proc transpose data=tall2 out=want ;
  by uid;
  id _name_ colnum ;
  var value;
run;
Ask a Question
Discussion stats
  • 8 replies
  • 85 views
  • 0 likes
  • 4 in conversation