BookmarkSubscribeRSS Feed

Manipulating Data in Base SAS® Part 6 – Transpose

Started ‎11-06-2024 by
Modified ‎11-19-2024 by
Views 2,991

Jordan_Mark_01_02.jpg Transposing data is useful when your data structure conflicts with your analysis needs. For example, transposing time series data can make trend analysis and forecasting easier, and customer transactional data is often transposed to enhance customer segmentation and targeting processes. In this article, I’ll discuss the “tools of the trade” for SAS programmers. Specifically, we’ll look at PROC TRANSPOSE and DATA step with arrays and DO Loops.

 

Simple Transpositions

Introduction

 

In this section, I’ll be working with two simple tables, work.narrow and work.wide:

 

Wide
Obs ID Value1 Value2 Value3 Value4 Value5
1 1 1A 1B 1C 1D 1E
2 2 2A 2B 2C 2D 2E
3 3 3A 3B 3C 3D 3E

 

Narrow
Obs ID Serial Value
1 1 1 1A
2 2 1 2A
3 3 1 3A
4 1 2 1B
5 2 2 2B
6 3 2 3B
7 1 3 1C
8 2 3 2C
9 3 3 3C
10 1 4 1D
11 2 4 2D
12 3 4 3D
13 1 5 1E
14 2 5 2E
15 3 5 3E

 

We’ll first perform a narrow to wide transposition of work.narrow to produce a result like work.wide, then a wide to narrow transposition of work.wide to produce a result like work.narrow.

 

PROC TRANSPOSE – Narrow to Wide

 

I’ll start with the simplest PROC TRANSPOSE code possible and have a look at the result.

 

proc transpose data=narrow 
                out=n2w_default;
run;

 

                                                 Narrow to Wide - Default
Obs _NAME_ COL1 COL2 ... COL12 COL13
1 ID 1 2 ... 2 3
2 Serial 1 1 ... 5 5

 

By default, PROC TRANSPOSE transposed the numeric columns only; character columns were discarded. I want the data from the Value column transposed, using ID as the row identifier. A BY statement will do the trick, but the data must be sorted first.

 

proc sort data=narrow out=narrow_sort;
	by id Serial;
run;

proc transpose data=narrow_sort 
                out=n2w_default;
	BY ID;
run;

 

Narrow to Wide - by ID
Obs ID _NAME_ COL1 COL2 COL3 COL4 COL5
1 1 Serial 1 2 3 4 5
2 2 Serial 1 2 3 5  
3 3 Serial 1 2 3 5  

 

The BY statement gets the correct grouping, but the values are still based on the Serial column instead of the Value column. I’ll tell PROC TRANSPOSE to use the Value column by inserting a VAR statement:

 

proc transpose data=narrow_sort 
                out=n2w_default;
	BY ID;
	var Value;
run;

 

Narrow to Wide - BY ID, VAR Value
Obs ID _NAME_ COL1 COL2 COL3 COL4 COL5
1 1 Value 1A 1B 1C 1D 1E
2 2 Value 2A 2D 2D 2E  
3 3 Value 3A 3B 3C 3E  

 

This is better, but in row 2, the data from ID2 Serial 3 is in COL2, and subsequent values are not aligned properly. In row 3, Value4 contains the value from Serial 5. I’d like meaningful, data-based column names. Instead of a _NAME_ column filled with ‘Value’, I’d like the column names containing the data to start with the prefix ‘Value’ with a numeric suffix corresponding to the Serial column values in the original dataset, and the data populated into the appropriate cell.  I’ll add an ID statement to specify the variable containing the variable naming data, and the PREFIX= option to specify the names should start with Value. I can drop the _NAME_ column, as I don’t need it any longer:

 

proc transpose data=narrow_sort 
               out=n2w_final (drop=_Name_)
               prefix=Value;
   BY ID;
   var Value; 
   ID Serial;
run;

 

And violá - very nice output! And that code wasn’t too hard to write.

 

Narrow to Wide - Final
Obs ID Value1 Value2 Value3 Value4 Value5
1 1 1A 1B 1C 1D 1E
2 2 2A   2C 2D 2E
3 3 3A 3B 3C   3E

 

PROC TRANSPOSE – Wide to Narrow

 

Once again, I’ll start by using the simplest code possible:

 

proc transpose data=Wide
               out=Narrow_Default(rename=(COL1=Value _NAME_=Ser));
    by ID;
    var VALUE1-VALUE5;
run;

 

The result is almost what I want.

 

Wide to Narrow Default
Obs ID Ser Value
1 1 Value1 1A
2 1 Value2 1B
3 1 Value3 1C
4 1 Value4 1D
5 1 Value5 1E
6 2 Value1 2A
7 2 Value2  
8 2 Value3 2C
9 2 Value4 2D
10 2 Value5 2E
11 3 Value1 3A
12 3 Value2 3B
13 3 Value3 3C
14 3 Value4  
15 3 Value5 3E

 

But there are no options or statements in PROC TRANSPOSE that allow extracting the number from the Ser column text to serve as the numeric value for a Serial column. I’ll also want to get rid of any rows where Value is missing. The simplest solution is to use a DATA step to post-process the result:

 

data Narrow_Final;
	length ID Serial 8;
	set Narrow_Default;
	Serial=input(compress(ser,,'kd'),10.);
	If not missing(Value);
	drop Ser;
run;

 

And that did it:

 

Wide to Narrow – Final
Obs ID Serial Value
1 1 1 1A
2 1 2 1B
3 1 3 1C
4 1 4 1D
5 1 5 1E
6 2 1 2A
7 2 3 2C
8 2 4 2D
9 2 5 2E
10 3 1 3A
11 3 2 3B
12 3 3 3C
13 3 5 3E

 

DATA Step Transposition

 

Well, if I have to use a DATA step in some cases, why not just do the whole thing in a DATA step? That’s also a reasonable approach, and if you have other data manipulation you want to accomplish at the same time as the transposition, I’d even suggest it’s the best approach. Here’s the code for the narrow to wide transposition:

 

proc sort data=narrow out=narrow_sort;
	by id Serial;
run;

data n2w_datastep;
	set narrow_sort;
	by id serial;
	/* Array of new variables to accept values */
	array val[*] $2 Value1-Value5;
	/* Multiple input rows for one output row - need to retain values */
	retain Value1-Value5;
	if first.id then do;
		/* Clean up before starting work on a new set of rows */
		call missing(of val[*]);
	end;
	/* Assign value to correct array element */
	val[Serial]=Value;
	if last.id then do;
		/* This set finished - write output */
		output;
	end;
run;

 

Narrow to Wide – DATA Step
Obs ID Value1 Value2 Value3 Value4 Value5
1 1 1A 1B 1C 1D 1E
2 2 2A   2C 2D 2E
3 3 3A 3B 3C   3E

 

And here’s the code for the wide to narrow transposition:

 

data w2n_datastep;
	keep ID Serial Value;
	set wide;
	/* Array of incoming value variables */
	array val[*] $2 Value1-Value5;
	/* These two plus ID will make up a row in the output */
	length Serial 8 Value $2;
	/* For each array element value */
	do Serial=1 to dim(val);
		/* Grab the value and write output row */
		Value=Val[Serial];
		if not missing(Value) then output;
	end;
run;

 

Wide to Narrow – DATA Step
Obs ID Serial Value
1 1 1 1A
2 1 2 1B
3 1 3 1C
4 1 4 1D
5 1 5 1E
6 2 1 2A
7 2 3 2C
8 2 4 2D
9 2 5 2E
10 3 1 3A
11 3 2 3B
12 3 3 3C
13 3 5 3E

 

I’ll admit that, on the surface, this code seems a bit more complex than the equivalent PROC TRANSPOSE code, but this approach is infinitely more flexible. Because I’m accustomed to using DATA step arrays in my work, this code actually feels less complex to me, and as a bonus it doesn’t require subsequent re-processing like the PROC TRANSPOSE solution did – everything got done in a single pass through the data.

 

Complex Transposition

 

Actual data is rarely as simple as the previous example. For instance, let’s consider a clinical trial data set. The data contains health information recorded during monthly clinic visits for a five year period:

 

Trial.Visits
Obs VisitDate ID CHF Weight Diastolic Systolic SmokingAvg Cholesterol
1 2019-01-01 F0011 N 67.5 178 88 150 14 224
2 2019-01-01 F0012 N 66.0 122 75 132 . 149
3 2019-01-01 F0027 N 61.5 138 93 165 . 264
4 2019-01-01 F0064 N 65.5 146 65 134 11 185
5 2019-01-01 F0084 N 65.0 163 80 126 . 321

 

PROC TRANSPOSE

 

First, the default transposition by ID, using VisitDate for column names:

 

proc transpose data=work.VisitsSorted
                out=work.proc_first_try 
                prefix=_;
	BY ID;
	ID VisitDate;
	format VisitDate yymon7.;
run;

 

PROC TRANSPOSE – First Try
 
Row ID _NAME_ _2019JAN _2019FEB _2019MAR   _2024OCT _2024NOV _2024DEC
1 F0001 Height 62.25 62.25 62.25 ... 62.25 62.25 62.25
2 F0001 Weight 134 126 130 ... 142 142 131
3 F0001 Diastolic 92 91 83 ... 79 97 93
4 F0001 Systolic 159 180 174 ... 169 174 178
5 F0001 SmokingAvg 12 11 10 ... 11 10 13
6 F0001 Cholesterol 223 227 235 ... 278 288 282
7 F0002 Height 65.75 65.75 65.75 ... 65.75 65.75 65.75
8 F0002 Weight 154 168 165 ... 145 163 149

 

Each row has 74 variables. To get all of the data for a particular ID on a single row, you’d need unique names for the columns from each row. I’ll use the value in the _NAME_ column as a prefix with the VisitDate as the suffix to create the column names. This example transposes the Height variable:

 

proc transpose data=trial.VisitsSmall 
                out=work.DSN_Height 
                prefix=Height;
	BY ID;
	ID VisitDate;
	Var Height;
	format VisitDate yymon7.;
run;

 

PROC TRANSPOSE – Height
Row ID _NAME_ Height2019JAN Height2019FEB   Height2024NOV Height2024DEC
1 F0001 Height 62.25 62.25 ... 62.25 62.25
2 F0002 Height 65.75 65.75 ... 65.75 65.75
3 F0003 Height 61.75 61.75 ... 61.75 61.75
4 F0004 Height 64.75 64.75 ... 64.75 64.75
5 F0005 Height 64.5 64.5 ... 64.5 64.5
6 F0006 Height 67.75 67.75 ... 67.75 67.75
7 F0007 Height 66.25 66.25 ... 66.25 66.25
8 F0008 Height 64 64 ... 64 64

 

To make this work, I’ll run a separate PROC TRANSPOSE step for each of the variables. After producing a data set for each measure variable, I’ll merge them by ID to produce a single row containing all of the measurement values. Because it’s a repetitive task, I’ll use a macro - the code for is in the series download package in a program file named Complex Transposition – PROC Transpose.sas. The final dataset contains 3118 rows and 433 columns. Here is a sample of the data:

 

Row ID Cholesterol2019JAN Cholesterol2019FEB   Weight2024NOV Weight2024DEC
1 F0001 223 227 ... 142 131
2 F0002 262 268 ... 163 149
3 F0003 196 197 ... 136 135
4 F0004 198 199 ... 142 137
5 F0005 199 198 ... 134 134
6 F0006 202 202 ... 173 149
7 F0007 193 192 ... 145 137
8 F0008 209 206 ... 144 140

 

This entire process, including all of the PROC TANSPOSE steps and the DATA step to merge the results, ran in about 2.5 seconds.

 

DATA Step Transposition

 

The DATA step code for this gets a bit tricky because so many variables are involved. To avoid having to type over 400 new variable names by hand and debug the inevitable typos, I chose to use PROC SQL to do a lot of the typing for me and put the result into macro variables.

proc sql noprint;
/* Create a list of all the new numeric variable names */
select distinct cats(Name,Date) 
	into :Varlist separated by ' '
	from dictionary.Columns as c
   ,(select distinct PUT(VisitDate,yymon7.) as Date
        from trial.Visits) as b
	where libname='TRIAL' and MEMNAME='VISITS'
	and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE- VARLIST=%superq(VARLIST);*/

/* Create a series of ARRAY statements for each measurement array */
select distinct catx(' ','array', cats(substr(Name,1,4),'[*]'),cats(substr(Name,1,4),':;'))
	into :Arrays separated by ' '
	from dictionary.Columns as c
	where libname='TRIAL' and MEMNAME='VISITS'
	and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE: Arrays=%superq(arrays);*/

/* Create a list of distinct YYYYMM values we will use to index the arrays */
select distinct PUT(VisitDate,yymon7.) 
   into :DateList separated by ' '
   from trial.Visits
;
/*%put NOTE- DateList=%superq(datelist);*/
quit;

Next, I used a DATA step to transpose the data:

 

data n2w_datastep;
   /* Establish the PDV in the order we desire */
	length ID $8 &VARLIST 8;
	/* Retain values for all new variables */
	retain &VARLIST;
	/* Set up an array for each measurement */
	&arrays;
	set work.VisitsSorted;
	by id VisitDate;
	if first.id then do;
		/* Clean up before starting work on a new set of rows */
		call missing(%qsysfunc(translate(&varlist,',',' ')));
	end;
	/* The date of this record determines the array element to be loaded */
	element=findw("&DateList",PUT(VisitDate,yymon7.), ' ', 'E');
	/* Assign values to correct array elements */
	Heig[element]=Height;
	Weig[element]=Weight;
	Dias[element]=Diastolic;
	Syst[element]=Systolic;
	Smok[element]=SmokingAvg;
	Chol[element]=Cholesterol;
	if last.id then do;
		/* Finished with this ID - write output */
		output;
	end;
	drop element CHF--Cholesterol;
run;

 

This worked perfectly. The PROC SQL step ran in 0.25 seconds,  and the DATA step in 0.2, for an overall time of less than half a second - much faster than the PROC TRANSPOSE process, if just a bit trickier to code. The nice thing about the DATA step is that I could have done additional processing, like computing overall statistics once the values for an ID are completely loaded.

 

The Pivot Point: Choosing a Method

 

I’ve used the DATA step to transpose data for years, mostly when other data manipulation was required during transposition. And for sheer speed and efficiency, the DATA step is hard to beat. But for simple, one-column transpositions PROC TRANSPOSE may be the easier choice.

 

Which technique for transposing data is your current “go to”? Has the code show here given you any new ideas?

 

Jordan_Mark_01_02.jpg Until next time, may the SAS be with you!

Mark  


Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip

 

Links to prior posts in this series:

 

 

Find more articles from SAS Global Enablement and Learning here.

Comments

MERGE skill is also a powerful and flexible tool to transpose data. I wrote a article about it before for somebody who need and interest it .

https://support.sas.com/resources/papers/proceedings15/2785-2015.pdf

@Ksharp - man, that's a great paper, and a very nice technique! Thanks for sharing 😊

 

Very helpful article!  The DOW-Loop also comes in handy for narrow-to-wide and complex transpositions.  I presented a paper showing some examples at PharmaSUG a few years back- "Getting Your Data in Shape with PROC TRANSPOSE" (https://www.lexjansen.com/pharmasug/2012/TF/PharmaSUG-2012-TF10.pdf)

An example of the pitfalls, and solutions, that await attempting to transpose character and numeric values in one pass might be appropriate.

Version history
Last update:
‎11-19-2024 12:38 PM
Updated by:
Contributors

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags