Transposing data is useful when your data structure conflicts with your analysis needs. For example, transposing time series data can make trend analysis and forecasting easier, and customer transactional data is often transposed to enhance customer segmentation and targeting processes. In this article, I’ll discuss the “tools of the trade” for SAS programmers. Specifically, we’ll look at PROC TRANSPOSE and DATA step with arrays and DO Loops.
In this section, I’ll be working with two simple tables, work.narrow and work.wide:
Wide | ||||||
Obs | ID | Value1 | Value2 | Value3 | Value4 | Value5 |
1 | 1 | 1A | 1B | 1C | 1D | 1E |
2 | 2 | 2A | 2B | 2C | 2D | 2E |
3 | 3 | 3A | 3B | 3C | 3D | 3E |
Narrow | |||
Obs | ID | Serial | Value |
1 | 1 | 1 | 1A |
2 | 2 | 1 | 2A |
3 | 3 | 1 | 3A |
4 | 1 | 2 | 1B |
5 | 2 | 2 | 2B |
6 | 3 | 2 | 3B |
7 | 1 | 3 | 1C |
8 | 2 | 3 | 2C |
9 | 3 | 3 | 3C |
10 | 1 | 4 | 1D |
11 | 2 | 4 | 2D |
12 | 3 | 4 | 3D |
13 | 1 | 5 | 1E |
14 | 2 | 5 | 2E |
15 | 3 | 5 | 3E |
We’ll first perform a narrow to wide transposition of work.narrow to produce a result like work.wide, then a wide to narrow transposition of work.wide to produce a result like work.narrow.
I’ll start with the simplest PROC TRANSPOSE code possible and have a look at the result.
proc transpose data=narrow
out=n2w_default;
run;
Narrow to Wide - Default | ||||||||||||||
Obs | _NAME_ | COL1 | COL2 | ... | COL12 | COL13 | ||||||||
1 | ID | 1 | 2 | ... | 2 | 3 | ||||||||
2 | Serial | 1 | 1 | ... | 5 | 5 |
By default, PROC TRANSPOSE transposed the numeric columns only; character columns were discarded. I want the data from the Value column transposed, using ID as the row identifier. A BY statement will do the trick, but the data must be sorted first.
proc sort data=narrow out=narrow_sort;
by id Serial;
run;
proc transpose data=narrow_sort
out=n2w_default;
BY ID;
run;
Narrow to Wide - by ID | |||||||
Obs | ID | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 |
1 | 1 | Serial | 1 | 2 | 3 | 4 | 5 |
2 | 2 | Serial | 1 | 2 | 3 | 5 | |
3 | 3 | Serial | 1 | 2 | 3 | 5 |
The BY statement gets the correct grouping, but the values are still based on the Serial column instead of the Value column. I’ll tell PROC TRANSPOSE to use the Value column by inserting a VAR statement:
proc transpose data=narrow_sort
out=n2w_default;
BY ID;
var Value;
run;
Narrow to Wide - BY ID, VAR Value | |||||||
Obs | ID | _NAME_ | COL1 | COL2 | COL3 | COL4 | COL5 |
1 | 1 | Value | 1A | 1B | 1C | 1D | 1E |
2 | 2 | Value | 2A | 2D | 2D | 2E | |
3 | 3 | Value | 3A | 3B | 3C | 3E |
This is better, but in row 2, the data from ID2 Serial 3 is in COL2, and subsequent values are not aligned properly. In row 3, Value4 contains the value from Serial 5. I’d like meaningful, data-based column names. Instead of a _NAME_ column filled with ‘Value’, I’d like the column names containing the data to start with the prefix ‘Value’ with a numeric suffix corresponding to the Serial column values in the original dataset, and the data populated into the appropriate cell. I’ll add an ID statement to specify the variable containing the variable naming data, and the PREFIX= option to specify the names should start with Value. I can drop the _NAME_ column, as I don’t need it any longer:
proc transpose data=narrow_sort
out=n2w_final (drop=_Name_)
prefix=Value;
BY ID;
var Value;
ID Serial;
run;
And violá - very nice output! And that code wasn’t too hard to write.
Narrow to Wide - Final | ||||||
Obs | ID | Value1 | Value2 | Value3 | Value4 | Value5 |
1 | 1 | 1A | 1B | 1C | 1D | 1E |
2 | 2 | 2A | 2C | 2D | 2E | |
3 | 3 | 3A | 3B | 3C | 3E |
Once again, I’ll start by using the simplest code possible:
proc transpose data=Wide
out=Narrow_Default(rename=(COL1=Value _NAME_=Ser));
by ID;
var VALUE1-VALUE5;
run;
The result is almost what I want.
Wide to Narrow Default | |||
Obs | ID | Ser | Value |
1 | 1 | Value1 | 1A |
2 | 1 | Value2 | 1B |
3 | 1 | Value3 | 1C |
4 | 1 | Value4 | 1D |
5 | 1 | Value5 | 1E |
6 | 2 | Value1 | 2A |
7 | 2 | Value2 | |
8 | 2 | Value3 | 2C |
9 | 2 | Value4 | 2D |
10 | 2 | Value5 | 2E |
11 | 3 | Value1 | 3A |
12 | 3 | Value2 | 3B |
13 | 3 | Value3 | 3C |
14 | 3 | Value4 | |
15 | 3 | Value5 | 3E |
But there are no options or statements in PROC TRANSPOSE that allow extracting the number from the Ser column text to serve as the numeric value for a Serial column. I’ll also want to get rid of any rows where Value is missing. The simplest solution is to use a DATA step to post-process the result:
data Narrow_Final;
length ID Serial 8;
set Narrow_Default;
Serial=input(compress(ser,,'kd'),10.);
If not missing(Value);
drop Ser;
run;
And that did it:
Wide to Narrow – Final | |||
Obs | ID | Serial | Value |
1 | 1 | 1 | 1A |
2 | 1 | 2 | 1B |
3 | 1 | 3 | 1C |
4 | 1 | 4 | 1D |
5 | 1 | 5 | 1E |
6 | 2 | 1 | 2A |
7 | 2 | 3 | 2C |
8 | 2 | 4 | 2D |
9 | 2 | 5 | 2E |
10 | 3 | 1 | 3A |
11 | 3 | 2 | 3B |
12 | 3 | 3 | 3C |
13 | 3 | 5 | 3E |
Well, if I have to use a DATA step in some cases, why not just do the whole thing in a DATA step? That’s also a reasonable approach, and if you have other data manipulation you want to accomplish at the same time as the transposition, I’d even suggest it’s the best approach. Here’s the code for the narrow to wide transposition:
proc sort data=narrow out=narrow_sort;
by id Serial;
run;
data n2w_datastep;
set narrow_sort;
by id serial;
/* Array of new variables to accept values */
array val[*] $2 Value1-Value5;
/* Multiple input rows for one output row - need to retain values */
retain Value1-Value5;
if first.id then do;
/* Clean up before starting work on a new set of rows */
call missing(of val[*]);
end;
/* Assign value to correct array element */
val[Serial]=Value;
if last.id then do;
/* This set finished - write output */
output;
end;
run;
Narrow to Wide – DATA Step | ||||||
Obs | ID | Value1 | Value2 | Value3 | Value4 | Value5 |
1 | 1 | 1A | 1B | 1C | 1D | 1E |
2 | 2 | 2A | 2C | 2D | 2E | |
3 | 3 | 3A | 3B | 3C | 3E |
And here’s the code for the wide to narrow transposition:
data w2n_datastep;
keep ID Serial Value;
set wide;
/* Array of incoming value variables */
array val[*] $2 Value1-Value5;
/* These two plus ID will make up a row in the output */
length Serial 8 Value $2;
/* For each array element value */
do Serial=1 to dim(val);
/* Grab the value and write output row */
Value=Val[Serial];
if not missing(Value) then output;
end;
run;
Wide to Narrow – DATA Step | |||
Obs | ID | Serial | Value |
1 | 1 | 1 | 1A |
2 | 1 | 2 | 1B |
3 | 1 | 3 | 1C |
4 | 1 | 4 | 1D |
5 | 1 | 5 | 1E |
6 | 2 | 1 | 2A |
7 | 2 | 3 | 2C |
8 | 2 | 4 | 2D |
9 | 2 | 5 | 2E |
10 | 3 | 1 | 3A |
11 | 3 | 2 | 3B |
12 | 3 | 3 | 3C |
13 | 3 | 5 | 3E |
I’ll admit that, on the surface, this code seems a bit more complex than the equivalent PROC TRANSPOSE code, but this approach is infinitely more flexible. Because I’m accustomed to using DATA step arrays in my work, this code actually feels less complex to me, and as a bonus it doesn’t require subsequent re-processing like the PROC TRANSPOSE solution did – everything got done in a single pass through the data.
Actual data is rarely as simple as the previous example. For instance, let’s consider a clinical trial data set. The data contains health information recorded during monthly clinic visits for a five year period:
Trial.Visits | |||||||||
Obs | VisitDate | ID | CHF | Weight | Diastolic | Systolic | SmokingAvg | Cholesterol | |
1 | 2019-01-01 | F0011 | N | 67.5 | 178 | 88 | 150 | 14 | 224 |
2 | 2019-01-01 | F0012 | N | 66.0 | 122 | 75 | 132 | . | 149 |
3 | 2019-01-01 | F0027 | N | 61.5 | 138 | 93 | 165 | . | 264 |
4 | 2019-01-01 | F0064 | N | 65.5 | 146 | 65 | 134 | 11 | 185 |
5 | 2019-01-01 | F0084 | N | 65.0 | 163 | 80 | 126 | . | 321 |
First, the default transposition by ID, using VisitDate for column names:
proc transpose data=work.VisitsSorted
out=work.proc_first_try
prefix=_;
BY ID;
ID VisitDate;
format VisitDate yymon7.;
run;
PROC TRANSPOSE – First Try | |||||||||
Row | ID | _NAME_ | _2019JAN | _2019FEB | _2019MAR | _2024OCT | _2024NOV | _2024DEC | |
1 | F0001 | Height | 62.25 | 62.25 | 62.25 | ... | 62.25 | 62.25 | 62.25 |
2 | F0001 | Weight | 134 | 126 | 130 | ... | 142 | 142 | 131 |
3 | F0001 | Diastolic | 92 | 91 | 83 | ... | 79 | 97 | 93 |
4 | F0001 | Systolic | 159 | 180 | 174 | ... | 169 | 174 | 178 |
5 | F0001 | SmokingAvg | 12 | 11 | 10 | ... | 11 | 10 | 13 |
6 | F0001 | Cholesterol | 223 | 227 | 235 | ... | 278 | 288 | 282 |
7 | F0002 | Height | 65.75 | 65.75 | 65.75 | ... | 65.75 | 65.75 | 65.75 |
8 | F0002 | Weight | 154 | 168 | 165 | ... | 145 | 163 | 149 |
Each row has 74 variables. To get all of the data for a particular ID on a single row, you’d need unique names for the columns from each row. I’ll use the value in the _NAME_ column as a prefix with the VisitDate as the suffix to create the column names. This example transposes the Height variable:
proc transpose data=trial.VisitsSmall
out=work.DSN_Height
prefix=Height;
BY ID;
ID VisitDate;
Var Height;
format VisitDate yymon7.;
run;
PROC TRANSPOSE – Height | |||||||||
Row | ID | _NAME_ | Height2019JAN | Height2019FEB | Height2024NOV | Height2024DEC | |||
1 | F0001 | Height | 62.25 | 62.25 | ... | 62.25 | 62.25 | ||
2 | F0002 | Height | 65.75 | 65.75 | ... | 65.75 | 65.75 | ||
3 | F0003 | Height | 61.75 | 61.75 | ... | 61.75 | 61.75 | ||
4 | F0004 | Height | 64.75 | 64.75 | ... | 64.75 | 64.75 | ||
5 | F0005 | Height | 64.5 | 64.5 | ... | 64.5 | 64.5 | ||
6 | F0006 | Height | 67.75 | 67.75 | ... | 67.75 | 67.75 | ||
7 | F0007 | Height | 66.25 | 66.25 | ... | 66.25 | 66.25 | ||
8 | F0008 | Height | 64 | 64 | ... | 64 | 64 |
To make this work, I’ll run a separate PROC TRANSPOSE step for each of the variables. After producing a data set for each measure variable, I’ll merge them by ID to produce a single row containing all of the measurement values. Because it’s a repetitive task, I’ll use a macro - the code for is in the series download package in a program file named Complex Transposition – PROC Transpose.sas. The final dataset contains 3118 rows and 433 columns. Here is a sample of the data:
Row | ID | Cholesterol2019JAN | Cholesterol2019FEB | Weight2024NOV | Weight2024DEC | |
1 | F0001 | 223 | 227 | ... | 142 | 131 |
2 | F0002 | 262 | 268 | ... | 163 | 149 |
3 | F0003 | 196 | 197 | ... | 136 | 135 |
4 | F0004 | 198 | 199 | ... | 142 | 137 |
5 | F0005 | 199 | 198 | ... | 134 | 134 |
6 | F0006 | 202 | 202 | ... | 173 | 149 |
7 | F0007 | 193 | 192 | ... | 145 | 137 |
8 | F0008 | 209 | 206 | ... | 144 | 140 |
This entire process, including all of the PROC TANSPOSE steps and the DATA step to merge the results, ran in about 2.5 seconds.
The DATA step code for this gets a bit tricky because so many variables are involved. To avoid having to type over 400 new variable names by hand and debug the inevitable typos, I chose to use PROC SQL to do a lot of the typing for me and put the result into macro variables.
proc sql noprint;
/* Create a list of all the new numeric variable names */
select distinct cats(Name,Date)
into :Varlist separated by ' '
from dictionary.Columns as c
,(select distinct PUT(VisitDate,yymon7.) as Date
from trial.Visits) as b
where libname='TRIAL' and MEMNAME='VISITS'
and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE- VARLIST=%superq(VARLIST);*/
/* Create a series of ARRAY statements for each measurement array */
select distinct catx(' ','array', cats(substr(Name,1,4),'[*]'),cats(substr(Name,1,4),':;'))
into :Arrays separated by ' '
from dictionary.Columns as c
where libname='TRIAL' and MEMNAME='VISITS'
and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE: Arrays=%superq(arrays);*/
/* Create a list of distinct YYYYMM values we will use to index the arrays */
select distinct PUT(VisitDate,yymon7.)
into :DateList separated by ' '
from trial.Visits
;
/*%put NOTE- DateList=%superq(datelist);*/
quit;
Next, I used a DATA step to transpose the data:
data n2w_datastep;
/* Establish the PDV in the order we desire */
length ID $8 &VARLIST 8;
/* Retain values for all new variables */
retain &VARLIST;
/* Set up an array for each measurement */
&arrays;
set work.VisitsSorted;
by id VisitDate;
if first.id then do;
/* Clean up before starting work on a new set of rows */
call missing(%qsysfunc(translate(&varlist,',',' ')));
end;
/* The date of this record determines the array element to be loaded */
element=findw("&DateList",PUT(VisitDate,yymon7.), ' ', 'E');
/* Assign values to correct array elements */
Heig[element]=Height;
Weig[element]=Weight;
Dias[element]=Diastolic;
Syst[element]=Systolic;
Smok[element]=SmokingAvg;
Chol[element]=Cholesterol;
if last.id then do;
/* Finished with this ID - write output */
output;
end;
drop element CHF--Cholesterol;
run;
This worked perfectly. The PROC SQL step ran in 0.25 seconds, and the DATA step in 0.2, for an overall time of less than half a second - much faster than the PROC TRANSPOSE process, if just a bit trickier to code. The nice thing about the DATA step is that I could have done additional processing, like computing overall statistics once the values for an ID are completely loaded.
I’ve used the DATA step to transpose data for years, mostly when other data manipulation was required during transposition. And for sheer speed and efficiency, the DATA step is hard to beat. But for simple, one-column transpositions PROC TRANSPOSE may be the easier choice.
Which technique for transposing data is your current “go to”? Has the code show here given you any new ideas?
Until next time, may the SAS be with you!
Mark
Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip
Links to prior posts in this series:
Find more articles from SAS Global Enablement and Learning here.
MERGE skill is also a powerful and flexible tool to transpose data. I wrote a article about it before for somebody who need and interest it .
https://support.sas.com/resources/papers/proceedings15/2785-2015.pdf
@Ksharp - man, that's a great paper, and a very nice technique! Thanks for sharing 😊
Very helpful article! The DOW-Loop also comes in handy for narrow-to-wide and complex transpositions. I presented a paper showing some examples at PharmaSUG a few years back- "Getting Your Data in Shape with PROC TRANSPOSE" (https://www.lexjansen.com/pharmasug/2012/TF/PharmaSUG-2012-TF10.pdf)
An example of the pitfalls, and solutions, that await attempting to transpose character and numeric values in one pass might be appropriate.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.