BookmarkSubscribeRSS Feed
foste261
Calcite | Level 5

Hello!

 

I'm working on cleaning up a dataset for analysis for my research team. However, transforming data is unfortunately my weakest skill in SAS. My graduate program prepared me really well for analysis and general data cleaning, but barely skimmed transforming and transposing data. The data I've worked with at my post-grad job so far has mostly been in the right format, with only mild transformations needed until now. I'm trying to teach myself data transformation and I've referenced The Little SAS Book multiple times but I'm unfortunately stumped with this type of transformation.

 

I have a dataset where the key identifier is ID. Most variables have one row of data. However, some variables have multiple rows of data. This leads to blank rows of data for the rest of the variables. I want to transform the data so a column with multiple rows is now in various columns, rather than rows. Here's an example with fake data of what my dataset currently looks like: 

 

ID              age            dx                        weight             test

1                 29           covid                    140                 covid

blank       blank       flu                       blank              blood pressure

blank       blank     diabetes              blank              blank

2                 33         high bp                165                  covid 

blank        blank    gest diab             blank              blood pressure 

blank        blank   preeclamp           blank              blank

 

So what I want to do is have 1 row of data for each unique ID. Each repeating instance for variables with multiple rows of data (like dx or test) would then be something like dx_1 dx_2 dx_3, etc and test_1 test_2 test_3.  

 

At first, I attempted to just delete the blank rows, which did work. But it consolidated all of the repeating instances into one row, which will cause issues for analysis. Therefore, I need to transpose them into new columns.

 

I would greatly appreciate some guidance. I am trying to master the skill of transforming data so that I can effectively do it on my own in the future. Thank you so much for your help!

                     

4 REPLIES 4
_el_doredo
Quartz | Level 8

I am not sure if this one is easier with large data. But, It worked with the data you provided. Next time please provide some code which you tried.

 

data temp;
infile cards delimiter=' ' dsd;
input id age dx :$15. weight test :$15.;
cards;
1 29 covid 140 covid
. . flu . "blood pressure"
. . diabetes . .
2 33 "high bp" 165 covid
. . "gest diab" . "blood pressure"
. . "preeclamp" . .
;
run;

data temp1;
set temp;
retain _id _age _weight;
if not missing(id) then _id=id;
else id=_id;
if not missing(age) then _age=age;
else age=_age;
if not missing(weight) then _weight=weight;
else weight=_weight;
drop _id _age _weight;
run;

proc transpose data=temp1 out=temp2 (drop=_name_) prefix=dx_;
by id age weight;
var dx;
run;

proc transpose data=temp1 out=temp3 (drop=_name_) prefix=test_;
by id age weight;
var test;
run;

data temp4;
set temp2;
set temp3;
by id age weight;
run;

ods select nlevels;
proc freq data=temp4 nlevels;
ods output nlevels=missing_var1;
run;

data missing_var(keep=tablevar);
set missing_var1;
if nnonmisslevels=0;
run;

proc sql;
select tablevar into : var separated by ' ' from missing_var;
quit;
%put &var.;

data final(drop=&var.);
set temp4;
run;

 

But, these are some open ends in your data like one is are we sure that the blanks in row 2 and 3 belongs to 1. I wrote the code with the assumption of row 2 and 3 belongs to row 1.

ballardw
Super User

You want to think very carefully about making a wide data set as you propose. What do you expect to do with it.

For almost any modeling, analysis or reporting SAS will work much better with a single observation/variable. Which would mean that bringing the previous values to the blank values would make more sense, especially for the identification variables.

 

Can you describe the sort of analysis you expect to do with the resulting data set?

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

 

Example: (I took some liberties with spelling to simplify code for the character variables).

data have;
  input ID  age  dx :$15. weight test :$15.;
datalines;
1 29 covid    140 covid
. .  flu      . blood_pressure
. .  diabetes . .
2 33 high_bp 165 covid 
. .  gest_diab . blood_pressure 
. .  preeclamp . .
;

An example of a likely better data set for most activities:

data moreuseful;
   set have;
   retain last_id Last_age Last_weight;
   if id then last_id=id;
   if age then last_age=age;
   if weight then last_weight=weight;
   if (missing(id)) then id=last_id;
   if (missing(age)) then age=last_age;
   if (missing(weight)) then weight=last_weight;
   drop last_id Last_age Last_weight;
run;

Just one example of why the "wide" data is hard to work with: Pick a DX value. How do you select the records with just that one for analysis? If you have 15 (or more) DX variables you have to examine every single one of them. Then spend some mildly obnoxious coding to find the associated test values.

data_null__
Jade | Level 19

To me the real problem is you don't have ID on all obs for an ID.  I think the first thing you should do is get the ID and demographic information and all records.   Then you have a data set that you can work with more easily.

 

data have;
   retain dummy 0;
   input ID  age  dx :$15. weight test :$15.;
   datalines;
1 29 covid    140 covid
. .  flu      . blood_pressure
. .  diabetes . .
2 33 high_bp 165 covid 
. .  gest_diab . blood_pressure 
. .  preeclamp . .
;

data want1(drop=dummy);
   update have(obs=0 keep=dummy) have(keep=dummy id);
   by dummy;
   set have(drop=dummy id);
   output;
   run;
proc print;
   run;
data want2;
   update want1(obs=0 keep=ID) want1(keep=id age weight);
   by id;
   if first.id then r=0;
   r+1;
   set want1(keep=dx test);
   output;
   run;
proc print;
   run;

proc transpose data=want2 out=want3;
   by id age weight r;
   var dx test;
   run;
proc sort data=want3;
   by id _name_ r;
   run;
proc transpose data=want3 out=want4(drop=_name_) delim=_;
   by id age weight;
   var col1;
   id _name_ r;
   run;
proc print;
   run;
   

This example uses the UPDATE statement to carry forward non-missing values ID,  and then again to update age and weight. 

Capture.PNG

Ksharp
Super User
data have;
  input ID  age  dx :$15. weight test :$15.;
datalines;
1 29 covid    140 covid
. .  flu      . blood_pressure
. .  diabetes . .
2 33 high_bp 165 covid 
. .  gest_diab . blood_pressure 
. .  preeclamp . .
;

data temp;
 set have;
 retain _id _age;
 if not missing(id) then _id=id;
 if not missing(age) then _age=age;
 drop id age;
run;
proc sql noprint;
select max(n) into :n
 from (select count(*) as n from temp group by _id,_age);
quit;
proc summary data=temp nway;
class _id _age;
output out=want idgroup(out[&n.] (dx weight test)=);
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 181 views
  • 0 likes
  • 5 in conversation