Transposing data is useful when your data structure conflicts with your analysis needs. For example, transposing time series data can make trend analysis and forecasting easier, and customer transactional data is often transposed to enhance customer segmentation and targeting processes. In this article, I’ll discuss the “tools of the trade” for SAS programmers. Specifically, we’ll look at PROC TRANSPOSE and DATA step with arrays and DO Loops.
Simple Transpositions
Introduction
In this section, I’ll be working with two simple tables, work.narrow and work.wide:
Wide
Obs
ID
Value1
Value2
Value3
Value4
Value5
1
1
1A
1B
1C
1D
1E
2
2
2A
2B
2C
2D
2E
3
3
3A
3B
3C
3D
3E
Narrow
Obs
ID
Serial
Value
1
1
1
1A
2
2
1
2A
3
3
1
3A
4
1
2
1B
5
2
2
2B
6
3
2
3B
7
1
3
1C
8
2
3
2C
9
3
3
3C
10
1
4
1D
11
2
4
2D
12
3
4
3D
13
1
5
1E
14
2
5
2E
15
3
5
3E
We’ll first perform a narrow to wide transposition of work.narrow to produce a result like work.wide, then a wide to narrow transposition of work.wide to produce a result like work.narrow.
PROC TRANSPOSE – Narrow to Wide
I’ll start with the simplest PROC TRANSPOSE code possible and have a look at the result.
proc transpose data=narrow
out=n2w_default;
run;
Narrow to Wide - Default
Obs
_NAME_
COL1
COL2
...
COL12
COL13
1
ID
1
2
...
2
3
2
Serial
1
1
...
5
5
By default, PROC TRANSPOSE transposed the numeric columns only; character columns were discarded. I want the data from the Value column transposed, using ID as the row identifier. A BY statement will do the trick, but the data must be sorted first.
proc sort data=narrow out=narrow_sort;
by id Serial;
run;
proc transpose data=narrow_sort
out=n2w_default;
BY ID;
run;
Narrow to Wide - by ID
Obs
ID
_NAME_
COL1
COL2
COL3
COL4
COL5
1
1
Serial
1
2
3
4
5
2
2
Serial
1
2
3
5
3
3
Serial
1
2
3
5
The BY statement gets the correct grouping, but the values are still based on the Serial column instead of the Value column. I’ll tell PROC TRANSPOSE to use the Value column by inserting a VAR statement:
proc transpose data=narrow_sort
out=n2w_default;
BY ID;
var Value;
run;
Narrow to Wide - BY ID, VAR Value
Obs
ID
_NAME_
COL1
COL2
COL3
COL4
COL5
1
1
Value
1A
1B
1C
1D
1E
2
2
Value
2A
2D
2D
2E
3
3
Value
3A
3B
3C
3E
This is better, but in row 2, the data from ID2 Serial 3 is in COL2, and subsequent values are not aligned properly. In row 3, Value4 contains the value from Serial 5. I’d like meaningful, data-based column names. Instead of a _NAME_ column filled with ‘Value’, I’d like the column names containing the data to start with the prefix ‘Value’ with a numeric suffix corresponding to the Serial column values in the original dataset, and the data populated into the appropriate cell. I’ll add an ID statement to specify the variable containing the variable naming data, and the PREFIX= option to specify the names should start with Value. I can drop the _NAME_ column, as I don’t need it any longer:
proc transpose data=narrow_sort
out=n2w_final (drop=_Name_)
prefix=Value;
BY ID;
var Value;
ID Serial;
run;
And violá - very nice output! And that code wasn’t too hard to write.
Narrow to Wide - Final
Obs
ID
Value1
Value2
Value3
Value4
Value5
1
1
1A
1B
1C
1D
1E
2
2
2A
2C
2D
2E
3
3
3A
3B
3C
3E
PROC TRANSPOSE – Wide to Narrow
Once again, I’ll start by using the simplest code possible:
proc transpose data=Wide
out=Narrow_Default(rename=(COL1=Value _NAME_=Ser));
by ID;
var VALUE1-VALUE5;
run;
The result is almost what I want.
Wide to Narrow Default
Obs
ID
Ser
Value
1
1
Value1
1A
2
1
Value2
1B
3
1
Value3
1C
4
1
Value4
1D
5
1
Value5
1E
6
2
Value1
2A
7
2
Value2
8
2
Value3
2C
9
2
Value4
2D
10
2
Value5
2E
11
3
Value1
3A
12
3
Value2
3B
13
3
Value3
3C
14
3
Value4
15
3
Value5
3E
But there are no options or statements in PROC TRANSPOSE that allow extracting the number from the Ser column text to serve as the numeric value for a Serial column. I’ll also want to get rid of any rows where Value is missing. The simplest solution is to use a DATA step to post-process the result:
data Narrow_Final;
length ID Serial 8;
set Narrow_Default;
Serial=input(compress(ser,,'kd'),10.);
If not missing(Value);
drop Ser;
run;
And that did it:
Wide to Narrow – Final
Obs
ID
Serial
Value
1
1
1
1A
2
1
2
1B
3
1
3
1C
4
1
4
1D
5
1
5
1E
6
2
1
2A
7
2
3
2C
8
2
4
2D
9
2
5
2E
10
3
1
3A
11
3
2
3B
12
3
3
3C
13
3
5
3E
DATA Step Transposition
Well, if I have to use a DATA step in some cases, why not just do the whole thing in a DATA step? That’s also a reasonable approach, and if you have other data manipulation you want to accomplish at the same time as the transposition, I’d even suggest it’s the best approach. Here’s the code for the narrow to wide transposition:
proc sort data=narrow out=narrow_sort;
by id Serial;
run;
data n2w_datastep;
set narrow_sort;
by id serial;
/* Array of new variables to accept values */
array val[*] $2 Value1-Value5;
/* Multiple input rows for one output row - need to retain values */
retain Value1-Value5;
if first.id then do;
/* Clean up before starting work on a new set of rows */
call missing(of val[*]);
end;
/* Assign value to correct array element */
val[Serial]=Value;
if last.id then do;
/* This set finished - write output */
output;
end;
run;
Narrow to Wide – DATA Step
Obs
ID
Value1
Value2
Value3
Value4
Value5
1
1
1A
1B
1C
1D
1E
2
2
2A
2C
2D
2E
3
3
3A
3B
3C
3E
And here’s the code for the wide to narrow transposition:
data w2n_datastep;
keep ID Serial Value;
set wide;
/* Array of incoming value variables */
array val[*] $2 Value1-Value5;
/* These two plus ID will make up a row in the output */
length Serial 8 Value $2;
/* For each array element value */
do Serial=1 to dim(val);
/* Grab the value and write output row */
Value=Val[Serial];
if not missing(Value) then output;
end;
run;
Wide to Narrow – DATA Step
Obs
ID
Serial
Value
1
1
1
1A
2
1
2
1B
3
1
3
1C
4
1
4
1D
5
1
5
1E
6
2
1
2A
7
2
3
2C
8
2
4
2D
9
2
5
2E
10
3
1
3A
11
3
2
3B
12
3
3
3C
13
3
5
3E
I’ll admit that, on the surface, this code seems a bit more complex than the equivalent PROC TRANSPOSE code, but this approach is infinitely more flexible. Because I’m accustomed to using DATA step arrays in my work, this code actually feels less complex to me, and as a bonus it doesn’t require subsequent re-processing like the PROC TRANSPOSE solution did – everything got done in a single pass through the data.
Complex Transposition
Actual data is rarely as simple as the previous example. For instance, let’s consider a clinical trial data set. The data contains health information recorded during monthly clinic visits for a five year period:
Trial.Visits
Obs
VisitDate
ID
CHF
Weight
Diastolic
Systolic
SmokingAvg
Cholesterol
1
2019-01-01
F0011
N
67.5
178
88
150
14
224
2
2019-01-01
F0012
N
66.0
122
75
132
.
149
3
2019-01-01
F0027
N
61.5
138
93
165
.
264
4
2019-01-01
F0064
N
65.5
146
65
134
11
185
5
2019-01-01
F0084
N
65.0
163
80
126
.
321
PROC TRANSPOSE
First, the default transposition by ID, using VisitDate for column names:
proc transpose data=work.VisitsSorted
out=work.proc_first_try
prefix=_;
BY ID;
ID VisitDate;
format VisitDate yymon7.;
run;
PROC TRANSPOSE – First Try
Row
ID
_NAME_
_2019JAN
_2019FEB
_2019MAR
_2024OCT
_2024NOV
_2024DEC
1
F0001
Height
62.25
62.25
62.25
...
62.25
62.25
62.25
2
F0001
Weight
134
126
130
...
142
142
131
3
F0001
Diastolic
92
91
83
...
79
97
93
4
F0001
Systolic
159
180
174
...
169
174
178
5
F0001
SmokingAvg
12
11
10
...
11
10
13
6
F0001
Cholesterol
223
227
235
...
278
288
282
7
F0002
Height
65.75
65.75
65.75
...
65.75
65.75
65.75
8
F0002
Weight
154
168
165
...
145
163
149
Each row has 74 variables. To get all of the data for a particular ID on a single row, you’d need unique names for the columns from each row. I’ll use the value in the _NAME_ column as a prefix with the VisitDate as the suffix to create the column names. This example transposes the Height variable:
proc transpose data=trial.VisitsSmall
out=work.DSN_Height
prefix=Height;
BY ID;
ID VisitDate;
Var Height;
format VisitDate yymon7.;
run;
PROC TRANSPOSE – Height
Row
ID
_NAME_
Height2019JAN
Height2019FEB
Height2024NOV
Height2024DEC
1
F0001
Height
62.25
62.25
...
62.25
62.25
2
F0002
Height
65.75
65.75
...
65.75
65.75
3
F0003
Height
61.75
61.75
...
61.75
61.75
4
F0004
Height
64.75
64.75
...
64.75
64.75
5
F0005
Height
64.5
64.5
...
64.5
64.5
6
F0006
Height
67.75
67.75
...
67.75
67.75
7
F0007
Height
66.25
66.25
...
66.25
66.25
8
F0008
Height
64
64
...
64
64
To make this work, I’ll run a separate PROC TRANSPOSE step for each of the variables. After producing a data set for each measure variable, I’ll merge them by ID to produce a single row containing all of the measurement values. Because it’s a repetitive task, I’ll use a macro - the code for is in the series download package in a program file named Complex Transposition – PROC Transpose.sas. The final dataset contains 3118 rows and 433 columns. Here is a sample of the data:
Row
ID
Cholesterol2019JAN
Cholesterol2019FEB
Weight2024NOV
Weight2024DEC
1
F0001
223
227
...
142
131
2
F0002
262
268
...
163
149
3
F0003
196
197
...
136
135
4
F0004
198
199
...
142
137
5
F0005
199
198
...
134
134
6
F0006
202
202
...
173
149
7
F0007
193
192
...
145
137
8
F0008
209
206
...
144
140
This entire process, including all of the PROC TANSPOSE steps and the DATA step to merge the results, ran in about 2.5 seconds.
DATA Step Transposition
The DATA step code for this gets a bit tricky because so many variables are involved. To avoid having to type over 400 new variable names by hand and debug the inevitable typos, I chose to use PROC SQL to do a lot of the typing for me and put the result into macro variables.
proc sql noprint;
/* Create a list of all the new numeric variable names */
select distinct cats(Name,Date)
into :Varlist separated by ' '
from dictionary.Columns as c
,(select distinct PUT(VisitDate,yymon7.) as Date
from trial.Visits) as b
where libname='TRIAL' and MEMNAME='VISITS'
and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE- VARLIST=%superq(VARLIST);*/
/* Create a series of ARRAY statements for each measurement array */
select distinct catx(' ','array', cats(substr(Name,1,4),'[*]'),cats(substr(Name,1,4),':;'))
into :Arrays separated by ' '
from dictionary.Columns as c
where libname='TRIAL' and MEMNAME='VISITS'
and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE: Arrays=%superq(arrays);*/
/* Create a list of distinct YYYYMM values we will use to index the arrays */
select distinct PUT(VisitDate,yymon7.)
into :DateList separated by ' '
from trial.Visits
;
/*%put NOTE- DateList=%superq(datelist);*/
quit;
Next, I used a DATA step to transpose the data:
data n2w_datastep;
/* Establish the PDV in the order we desire */
length ID $8 &VARLIST 8;
/* Retain values for all new variables */
retain &VARLIST;
/* Set up an array for each measurement */
&arrays;
set work.VisitsSorted;
by id VisitDate;
if first.id then do;
/* Clean up before starting work on a new set of rows */
call missing(%qsysfunc(translate(&varlist,',',' ')));
end;
/* The date of this record determines the array element to be loaded */
element=findw("&DateList",PUT(VisitDate,yymon7.), ' ', 'E');
/* Assign values to correct array elements */
Heig[element]=Height;
Weig[element]=Weight;
Dias[element]=Diastolic;
Syst[element]=Systolic;
Smok[element]=SmokingAvg;
Chol[element]=Cholesterol;
if last.id then do;
/* Finished with this ID - write output */
output;
end;
drop element CHF--Cholesterol;
run;
This worked perfectly. The PROC SQL step ran in 0.25 seconds, and the DATA step in 0.2, for an overall time of less than half a second - much faster than the PROC TRANSPOSE process, if just a bit trickier to code. The nice thing about the DATA step is that I could have done additional processing, like computing overall statistics once the values for an ID are completely loaded.
The Pivot Point: Choosing a Method
I’ve used the DATA step to transpose data for years, mostly when other data manipulation was required during transposition. And for sheer speed and efficiency, the DATA step is hard to beat. But for simple, one-column transpositions PROC TRANSPOSE may be the easier choice.
Which technique for transposing data is your current “go to”? Has the code show here given you any new ideas?
Until next time, may the SAS be with you!
Mark
Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip
Links to prior posts in this series:
Part 1 – Append
Part 2 – Sort
Part 3 – Deduplicate
Part 4 – Lookup
Part 5 – Aggregate
Find more articles from SAS Global Enablement and Learning here.
... View more