BookmarkSubscribeRSS Feed
fwashburn
Fluorite | Level 6

Hi all! I'm working in SAS 9.4 with a very large dataset (5 million plus records, 172 columns) that has bits and pieces of information scattered throughout. The unique ID (I'll call UID) is repeated on multiple rows. I want to collapse all instances of a UID into one row, keeping any nonmissing, unique values, without overwriting any other nonmissing, unique values.

 

After hours of searching for solutions, I found this: http://support.sas.com/kb/32/288.html - but alas, it's only for Enterprise Guide. This note describes exactly what I'm trying to do though.

 

Here's a very simplified example of my messy dataset:

 

data have;
   input uid $ otherid1 $ otherid2 $ anotherid1 $ anotherid2 $;
   datalines;
xyz$3tyu 00012345 00123456 03456 78901
xyz$3tyu 00012345 01204789 34512 78901 ; 

 

8 REPLIES 8
Tom
Super User Tom
Super User

Use the UPDATE statement.  When the new record has a missing value it leaves the old value in place.

You need a master and transaction dataset to it to work, but you can use your same dataset as both.  Just start with an empty master dataset.

data want ;
  update have(obs=0) have ;
  by uid;
run;
fwashburn
Fluorite | Level 6

Tom, thank you so much for this information. I've seen documentation on the UPDATE statement; however, it seems that I would need to write out many lines of code just to make sure that none of my 171 non-UID variables get overwritten in the process (based on what I read here: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202975.htm).

 

While I'm not opposed to writing out a lot of code, it does introduce plenty of room for user error. Do you know if there is a more user-friendly approach?

Tom
Super User Tom
Super User

Doesn't sound like want to collapse out missing values at all.

Rather it looks like you just want to transpose your data.  That is what the first link was doing.

Instead of going for a wide format couldn't you just go for a tall skinny format?

* Add unique row id ;
data step1 ;
  set have ;
  row+1;
run;
* transpose ;
proc transpose data=step1 out=step2  ;
  by row uid ;
  var _all_;
run;
* find unique combinations;
proc sort nodupkey data=step2 out=step3 ;
  by uid _name_ col1 ;
  where upcase(_name_) not in ('ROW','UID');
  where also not missing(col1);
run;

If you really wanted you could turn that back into one row per UID, but you probably need to add your own duplicate counter so that you could generate unique variable names.

data step4;
  set step3 ;
  by uid _name_;
  colnum +1;
  if first._name_ then colnum=1;
run;
proc transpose data=step4 out=want ;
  by uid;
  id _name_ colnum ;
  var col1;
run;
ballardw
Super User

The example data for the EG in your link is a VERY hard to work with format to work with for a large number of processing and analysis tasks. Note that in the linked example the input data set starts with 3 columns and ends with 9: 1 + (3-1)*4 columns because there are 4 rows with the same id. How many times is your ID variable duplicated for the one that is duplicated the most? Plug that value into: 1+(172-1)* x, where is that number of replications of Id. That is how many variables (columns) the resulting data set will have.

 

Assume that your ID is duplicated at most 6 times. Then you have 1+ 171*6 or 1027 variables.

I left the 1 in the formula in case it actually takes more variables to uniquely identify a record, replace 1 with that number.

 

If each identifier is not repeated the exact same number of times then you also start having lots of missing values.

 

Here is an example using the code similar to what would be generated behind the EG interface working with a data set you should have.

proc sort data=sashelp.class out=work.class;
   by sex;
run;
data work.test;
   set work.class;
   by sex;
   retain N (0);
   if first.sex then N=1;
   else n+1;
run;
proc transpose data=work.test out=work.trans;
   by sex n;
   var name age height weight;
run;

data work.test2;
   set work.trans;
   newname=cats(_name_,N);
run;

proc transpose data=work.test2 
               out=work.want (drop=_name_);
   by sex;
   var col1;
   id newname;
run;

Note that with 9 females and 10 males (sex = F and M) that some of the variables for the F row have missing values.

 

fwashburn
Fluorite | Level 6

BallardW: Thank you for your framing of the problem - I know this is a crazy conundrum and I wish it weren't so complicated; it'd sure make my life a lot easier!

 

Each UID repeats up to 9 times in the 5 million+ record dataset, with the majority of UIDs occurring 1 to 5 times. The 171 non-UID variables comprise 2 sets of other unique identifiers (90 potential values for otherid; 81 potential values for anotherid) that are attached to a UID. A UID may appear twice and have 80 otherid's that are the same in both records, but 10 otherid's that are different. 

 

Basically I want my end dataset to show one record for one UID, with all unique otherid and anotherid values listed out to the right. But I'm not certain that this is possible without some very verbose code.

Astounding
PROC Star

Just to confirm some of the overall guidelines ...

 

For a single UID, it sounds like the order doesn't matter.  If OTHERID1 takes on values ABC, DEF, and GHI, it doesn't matter which value matches with values of the other 171 columns.

 

Do you have any additional helpful information, such as:

 

  • maximum number of observations for a single UID
  • among the 172 other variables, is there always a mix of character and numeric variables?
fwashburn
Fluorite | Level 6

Astounding - thanks for taking a look.

 

A UID can repeat up to 9 times, and each time it repeats, it can have up to 171 variables.

 

Among the 171 other variables, every one is in a character format (though some appear as numeric), because there is a potential for leading zeroes.

 

I may be misunderstanding your statement on the order not mattering, but it definitely matters which values match with the UID. My end goal is to create a comprehensive dataset with one record for one UID, and all unique nonmissing otherids and anotherids listed out. I don't think that there will be more than a few hundred total, and I expect that on average, there will be maybe 5 otherids and 10 anotherids per UID; it's just getting them into the proper structure that's stumping me.

Tom
Super User Tom
Super User

Perhaps the problem is that your starting data is neither all TALL or all WIDE but some sort of mixed bag?

In that case it might be better to use arrays to make the the tall skinny file so that you map all of the variables in the same "array" to same target name.

data tall ;
  set have ;
  length _name_ $32 value $50 ;
  array A otherid: ;
  array B anotherid: ;
  _name_ ='OTHERID';
  do i=1 to dim(A);
     if not missing(a(i)) then do;
        value=a(i);
        output;
    end;
  end;
  _name_ ='ANOTHERID';
  do i=1 to dim(B);
     if not missing(b(i)) then do;
        value=b(i);
        output;
    end;
  end;
 keep uid _name_ value ;
run;

Now you can sort this with NODUPKEY option to find the unique values.

proc sort nodupkey;
  by uid _name_ value;
run;

You can add a counter and transpose it back up if you want to get back to a wide format.

data tall2 ;
  set tall ;
  by uid _name_;
  colnum +1;
  if first._name_ then colnum=1;
run;
proc transpose data=tall2 out=want ;
  by uid;
  id _name_ colnum ;
  var value;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 18107 views
  • 3 likes
  • 4 in conversation