Re: Transposing Data with Blank Rows and Repeating Instances

foste261 · Posted 10-31-2024 01:21 PM

Hello!

I'm working on cleaning up a dataset for analysis for my research team. However, transforming data is unfortunately my weakest skill in SAS. My graduate program prepared me really well for analysis and general data cleaning, but barely skimmed transforming and transposing data. The data I've worked with at my post-grad job so far has mostly been in the right format, with only mild transformations needed until now. I'm trying to teach myself data transformation and I've referenced The Little SAS Book multiple times but I'm unfortunately stumped with this type of transformation.

I have a dataset where the key identifier is ID. Most variables have one row of data. However, some variables have multiple rows of data. This leads to blank rows of data for the rest of the variables. I want to transform the data so a column with multiple rows is now in various columns, rather than rows. Here's an example with fake data of what my dataset currently looks like:

ID age dx weight test

1 29 covid 140 covid

blank blank flu blank blood pressure

blank blank diabetes blank blank

2 33 high bp 165 covid

blank blank gest diab blank blood pressure

blank blank preeclamp blank blank

So what I want to do is have 1 row of data for each unique ID. Each repeating instance for variables with multiple rows of data (like dx or test) would then be something like dx_1 dx_2 dx_3, etc and test_1 test_2 test_3.

At first, I attempted to just delete the blank rows, which did work. But it consolidated all of the repeating instances into one row, which will cause issues for analysis. Therefore, I need to transpose them into new columns.

I would greatly appreciate some guidance. I am trying to master the skill of transforming data so that I can effectively do it on my own in the future. Thank you so much for your help!

_el_doredo · Posted 10-31-2024 02:15 PM

I am not sure if this one is easier with large data. But, It worked with the data you provided. Next time please provide some code which you tried.

data temp;
infile cards delimiter=' ' dsd;
input id age dx :$15. weight test :$15.;
cards;
1 29 covid 140 covid
. . flu . "blood pressure"
. . diabetes . .
2 33 "high bp" 165 covid
. . "gest diab" . "blood pressure"
. . "preeclamp" . .
;
run;

data temp1;
set temp;
retain _id _age _weight;
if not missing(id) then _id=id;
else id=_id;
if not missing(age) then _age=age;
else age=_age;
if not missing(weight) then _weight=weight;
else weight=_weight;
drop _id _age _weight;
run;

proc transpose data=temp1 out=temp2 (drop=_name_) prefix=dx_;
by id age weight;
var dx;
run;

proc transpose data=temp1 out=temp3 (drop=_name_) prefix=test_;
by id age weight;
var test;
run;

data temp4;
set temp2;
set temp3;
by id age weight;
run;

ods select nlevels;
proc freq data=temp4 nlevels;
ods output nlevels=missing_var1;
run;

data missing_var(keep=tablevar);
set missing_var1;
if nnonmisslevels=0;
run;

proc sql;
select tablevar into : var separated by ' ' from missing_var;
quit;
%put &var.;

data final(drop=&var.);
set temp4;
run;

But, these are some open ends in your data like one is are we sure that the blanks in row 2 and 3 belongs to 1. I wrote the code with the assumption of row 2 and 3 belongs to row 1.

ballardw · Posted 10-31-2024 02:29 PM

You want to think very carefully about making a wide data set as you propose. What do you expect to do with it.

For almost any modeling, analysis or reporting SAS will work much better with a single observation/variable. Which would mean that bringing the previous values to the blank values would make more sense, especially for the identification variables.

Can you describe the sort of analysis you expect to do with the resulting data set?

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

Example: (I took some liberties with spelling to simplify code for the character variables).

data have;
  input ID  age  dx :$15. weight test :$15.;
datalines;
1 29 covid    140 covid
. .  flu      . blood_pressure
. .  diabetes . .
2 33 high_bp 165 covid 
. .  gest_diab . blood_pressure 
. .  preeclamp . .
;

An example of a likely better data set for most activities:

data moreuseful;
   set have;
   retain last_id Last_age Last_weight;
   if id then last_id=id;
   if age then last_age=age;
   if weight then last_weight=weight;
   if (missing(id)) then id=last_id;
   if (missing(age)) then age=last_age;
   if (missing(weight)) then weight=last_weight;
   drop last_id Last_age Last_weight;
run;

Just one example of why the "wide" data is hard to work with: Pick a DX value. How do you select the records with just that one for analysis? If you have 15 (or more) DX variables you have to examine every single one of them. Then spend some mildly obnoxious coding to find the associated test values.

data_null__ · Posted 10-31-2024 03:22 PM

To me the real problem is you don't have ID on all obs for an ID. I think the first thing you should do is get the ID and demographic information and all records. Then you have a data set that you can work with more easily.

data have;
   retain dummy 0;
   input ID  age  dx :$15. weight test :$15.;
   datalines;
1 29 covid    140 covid
. .  flu      . blood_pressure
. .  diabetes . .
2 33 high_bp 165 covid 
. .  gest_diab . blood_pressure 
. .  preeclamp . .
;

data want1(drop=dummy);
   update have(obs=0 keep=dummy) have(keep=dummy id);
   by dummy;
   set have(drop=dummy id);
   output;
   run;
proc print;
   run;
data want2;
   update want1(obs=0 keep=ID) want1(keep=id age weight);
   by id;
   if first.id then r=0;
   r+1;
   set want1(keep=dx test);
   output;
   run;
proc print;
   run;

proc transpose data=want2 out=want3;
   by id age weight r;
   var dx test;
   run;
proc sort data=want3;
   by id _name_ r;
   run;
proc transpose data=want3 out=want4(drop=_name_) delim=_;
   by id age weight;
   var col1;
   id _name_ r;
   run;
proc print;
   run;

This example uses the UPDATE statement to carry forward non-missing values ID, and then again to update age and weight.

Ksharp · Posted 10-31-2024 10:05 PM

data have;
  input ID  age  dx :$15. weight test :$15.;
datalines;
1 29 covid    140 covid
. .  flu      . blood_pressure
. .  diabetes . .
2 33 high_bp 165 covid 
. .  gest_diab . blood_pressure 
. .  preeclamp . .
;

data temp;
 set have;
 retain _id _age;
 if not missing(id) then _id=id;
 if not missing(age) then _age=age;
 drop id age;
run;
proc sql noprint;
select max(n) into :n
 from (select count(*) as n from temp group by _id,_age);
quit;
proc summary data=temp nway;
class _id _age;
output out=want idgroup(out[&n.] (dx weight test)=);
run;

Transposing Data with Blank Rows and Repeating Instances