Manipulating Data in Base SAS® Part 6 – Transpose

2 Likes

Transposing data is useful when your data structure conflicts with your analysis needs. For example, transposing time series data can make trend analysis and forecasting easier, and customer transactional data is often transposed to enhance customer segmentation and targeting processes. In this article, I’ll discuss the “tools of the trade” for SAS programmers. Specifically, we’ll look at PROC TRANSPOSE and DATA step with arrays and DO Loops.

Simple Transpositions

Introduction

In this section, I’ll be working with two simple tables, work.narrow and work.wide:

Wide
Obs	ID	Value1	Value2	Value3	Value4	Value5
1	1	1A	1B	1C	1D	1E
2	2	2A	2B	2C	2D	2E
3	3	3A	3B	3C	3D	3E

Narrow
Obs	ID	Serial	Value
1	1	1	1A
2	2	1	2A
3	3	1	3A
4	1	2	1B
5	2	2	2B
6	3	2	3B
7	1	3	1C
8	2	3	2C
9	3	3	3C
10	1	4	1D
11	2	4	2D
12	3	4	3D
13	1	5	1E
14	2	5	2E
15	3	5	3E

We’ll first perform a narrow to wide transposition of work.narrow to produce a result like work.wide, then a wide to narrow transposition of work.wide to produce a result like work.narrow.

PROC TRANSPOSE – Narrow to Wide

I’ll start with the simplest PROC TRANSPOSE code possible and have a look at the result.

proc transpose data=narrow 
                out=n2w_default;
run;

Narrow to Wide - Default
Obs	_NAME_	COL1	COL2	...	COL12	COL13
1	ID	1	2	...	2	3
2	Serial	1	1	...	5	5

By default, PROC TRANSPOSE transposed the numeric columns only; character columns were discarded. I want the data from the Value column transposed, using ID as the row identifier. A BY statement will do the trick, but the data must be sorted first.

proc sort data=narrow out=narrow_sort;
	by id Serial;
run;

proc transpose data=narrow_sort 
                out=n2w_default;
	BY ID;
run;

Narrow to Wide - by ID
Obs	ID	_NAME_	COL1	COL2	COL3	COL4	COL5
1	1	Serial	1	2	3	4	5
2	2	Serial	1	2	3	5
3	3	Serial	1	2	3	5

The BY statement gets the correct grouping, but the values are still based on the Serial column instead of the Value column. I’ll tell PROC TRANSPOSE to use the Value column by inserting a VAR statement:

proc transpose data=narrow_sort 
                out=n2w_default;
	BY ID;
	var Value;
run;

Narrow to Wide - BY ID, VAR Value
Obs	ID	_NAME_	COL1	COL2	COL3	COL4	COL5
1	1	Value	1A	1B	1C	1D	1E
2	2	Value	2A	2D	2D	2E
3	3	Value	3A	3B	3C	3E

This is better, but in row 2, the data from ID2 Serial 3 is in COL2, and subsequent values are not aligned properly. In row 3, Value4 contains the value from Serial 5. I’d like meaningful, data-based column names. Instead of a _NAME_ column filled with ‘Value’, I’d like the column names containing the data to start with the prefix ‘Value’ with a numeric suffix corresponding to the Serial column values in the original dataset, and the data populated into the appropriate cell. I’ll add an ID statement to specify the variable containing the variable naming data, and the PREFIX= option to specify the names should start with Value. I can drop the _NAME_ column, as I don’t need it any longer:

proc transpose data=narrow_sort 
               out=n2w_final (drop=_Name_)
               prefix=Value;
   BY ID;
   var Value; 
   ID Serial;
run;

And violá - very nice output! And that code wasn’t too hard to write.

Narrow to Wide - Final
Obs	ID	Value1	Value2	Value3	Value4	Value5
1	1	1A	1B	1C	1D	1E
2	2	2A		2C	2D	2E
3	3	3A	3B	3C		3E

PROC TRANSPOSE – Wide to Narrow

Once again, I’ll start by using the simplest code possible:

proc transpose data=Wide
               out=Narrow_Default(rename=(COL1=Value _NAME_=Ser));
    by ID;
    var VALUE1-VALUE5;
run;

The result is almost what I want.

Wide to Narrow Default
Obs	ID	Ser	Value
1	1	Value1	1A
2	1	Value2	1B
3	1	Value3	1C
4	1	Value4	1D
5	1	Value5	1E
6	2	Value1	2A
7	2	Value2
8	2	Value3	2C
9	2	Value4	2D
10	2	Value5	2E
11	3	Value1	3A
12	3	Value2	3B
13	3	Value3	3C
14	3	Value4
15	3	Value5	3E

But there are no options or statements in PROC TRANSPOSE that allow extracting the number from the Ser column text to serve as the numeric value for a Serial column. I’ll also want to get rid of any rows where Value is missing. The simplest solution is to use a DATA step to post-process the result:

data Narrow_Final;
	length ID Serial 8;
	set Narrow_Default;
	Serial=input(compress(ser,,'kd'),10.);
	If not missing(Value);
	drop Ser;
run;

And that did it:

Wide to Narrow – Final
Obs	ID	Serial	Value
1	1	1	1A
2	1	2	1B
3	1	3	1C
4	1	4	1D
5	1	5	1E
6	2	1	2A
7	2	3	2C
8	2	4	2D
9	2	5	2E
10	3	1	3A
11	3	2	3B
12	3	3	3C
13	3	5	3E

DATA Step Transposition

Well, if I have to use a DATA step in some cases, why not just do the whole thing in a DATA step? That’s also a reasonable approach, and if you have other data manipulation you want to accomplish at the same time as the transposition, I’d even suggest it’s the best approach. Here’s the code for the narrow to wide transposition:

proc sort data=narrow out=narrow_sort;
	by id Serial;
run;

data n2w_datastep;
	set narrow_sort;
	by id serial;
	/* Array of new variables to accept values */
	array val[*] $2 Value1-Value5;
	/* Multiple input rows for one output row - need to retain values */
	retain Value1-Value5;
	if first.id then do;
		/* Clean up before starting work on a new set of rows */
		call missing(of val[*]);
	end;
	/* Assign value to correct array element */
	val[Serial]=Value;
	if last.id then do;
		/* This set finished - write output */
		output;
	end;
run;

Narrow to Wide – DATA Step
Obs	ID	Value1	Value2	Value3	Value4	Value5
1	1	1A	1B	1C	1D	1E
2	2	2A		2C	2D	2E
3	3	3A	3B	3C		3E

And here’s the code for the wide to narrow transposition:

data w2n_datastep;
	keep ID Serial Value;
	set wide;
	/* Array of incoming value variables */
	array val[*] $2 Value1-Value5;
	/* These two plus ID will make up a row in the output */
	length Serial 8 Value $2;
	/* For each array element value */
	do Serial=1 to dim(val);
		/* Grab the value and write output row */
		Value=Val[Serial];
		if not missing(Value) then output;
	end;
run;

Wide to Narrow – DATA Step
Obs	ID	Serial	Value
1	1	1	1A
2	1	2	1B
3	1	3	1C
4	1	4	1D
5	1	5	1E
6	2	1	2A
7	2	3	2C
8	2	4	2D
9	2	5	2E
10	3	1	3A
11	3	2	3B
12	3	3	3C
13	3	5	3E

I’ll admit that, on the surface, this code seems a bit more complex than the equivalent PROC TRANSPOSE code, but this approach is infinitely more flexible. Because I’m accustomed to using DATA step arrays in my work, this code actually feels less complex to me, and as a bonus it doesn’t require subsequent re-processing like the PROC TRANSPOSE solution did – everything got done in a single pass through the data.

Complex Transposition

Actual data is rarely as simple as the previous example. For instance, let’s consider a clinical trial data set. The data contains health information recorded during monthly clinic visits for a five year period:

Trial.Visits
Obs	VisitDate	ID	CHF	Weight	Diastolic	Systolic	SmokingAvg	Cholesterol
1	2019-01-01	F0011	N	67.5	178	88	150	14	224
2	2019-01-01	F0012	N	66.0	122	75	132	.	149
3	2019-01-01	F0027	N	61.5	138	93	165	.	264
4	2019-01-01	F0064	N	65.5	146	65	134	11	185
5	2019-01-01	F0084	N	65.0	163	80	126	.	321

PROC TRANSPOSE

First, the default transposition by ID, using VisitDate for column names:

proc transpose data=work.VisitsSorted
                out=work.proc_first_try 
                prefix=_;
	BY ID;
	ID VisitDate;
	format VisitDate yymon7.;
run;

PROC TRANSPOSE – First Try

Row	ID	_NAME_	_2019JAN	_2019FEB	_2019MAR		_2024OCT	_2024NOV	_2024DEC
1	F0001	Height	62.25	62.25	62.25	...	62.25	62.25	62.25
2	F0001	Weight	134	126	130	...	142	142	131
3	F0001	Diastolic	92	91	83	...	79	97	93
4	F0001	Systolic	159	180	174	...	169	174	178
5	F0001	SmokingAvg	12	11	10	...	11	10	13
6	F0001	Cholesterol	223	227	235	...	278	288	282
7	F0002	Height	65.75	65.75	65.75	...	65.75	65.75	65.75
8	F0002	Weight	154	168	165	...	145	163	149

Each row has 74 variables. To get all of the data for a particular ID on a single row, you’d need unique names for the columns from each row. I’ll use the value in the _NAME_ column as a prefix with the VisitDate as the suffix to create the column names. This example transposes the Height variable:

proc transpose data=trial.VisitsSmall 
                out=work.DSN_Height 
                prefix=Height;
	BY ID;
	ID VisitDate;
	Var Height;
	format VisitDate yymon7.;
run;

PROC TRANSPOSE – Height
Row	ID	_NAME_	Height2019JAN	Height2019FEB		Height2024NOV	Height2024DEC
1	F0001	Height	62.25	62.25	...	62.25	62.25
2	F0002	Height	65.75	65.75	...	65.75	65.75
3	F0003	Height	61.75	61.75	...	61.75	61.75
4	F0004	Height	64.75	64.75	...	64.75	64.75
5	F0005	Height	64.5	64.5	...	64.5	64.5
6	F0006	Height	67.75	67.75	...	67.75	67.75
7	F0007	Height	66.25	66.25	...	66.25	66.25
8	F0008	Height	64	64	...	64	64

To make this work, I’ll run a separate PROC TRANSPOSE step for each of the variables. After producing a data set for each measure variable, I’ll merge them by ID to produce a single row containing all of the measurement values. Because it’s a repetitive task, I’ll use a macro - the code for is in the series download package in a program file named Complex Transposition – PROC Transpose.sas. The final dataset contains 3118 rows and 433 columns. Here is a sample of the data:

Row	ID	Cholesterol2019JAN	Cholesterol2019FEB		Weight2024NOV	Weight2024DEC
1	F0001	223	227	...	142	131
2	F0002	262	268	...	163	149
3	F0003	196	197	...	136	135
4	F0004	198	199	...	142	137
5	F0005	199	198	...	134	134
6	F0006	202	202	...	173	149
7	F0007	193	192	...	145	137
8	F0008	209	206	...	144	140

This entire process, including all of the PROC TANSPOSE steps and the DATA step to merge the results, ran in about 2.5 seconds.

DATA Step Transposition

The DATA step code for this gets a bit tricky because so many variables are involved. To avoid having to type over 400 new variable names by hand and debug the inevitable typos, I chose to use PROC SQL to do a lot of the typing for me and put the result into macro variables.

proc sql noprint;
/* Create a list of all the new numeric variable names */
select distinct cats(Name,Date) 
	into :Varlist separated by ' '
	from dictionary.Columns as c
   ,(select distinct PUT(VisitDate,yymon7.) as Date
        from trial.Visits) as b
	where libname='TRIAL' and MEMNAME='VISITS'
	and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE- VARLIST=%superq(VARLIST);*/

/* Create a series of ARRAY statements for each measurement array */
select distinct catx(' ','array', cats(substr(Name,1,4),'[*]'),cats(substr(Name,1,4),':;'))
	into :Arrays separated by ' '
	from dictionary.Columns as c
	where libname='TRIAL' and MEMNAME='VISITS'
	and Name not in('ID','VisitDate') and type='num'
;
/*%put NOTE: Arrays=%superq(arrays);*/

/* Create a list of distinct YYYYMM values we will use to index the arrays */
select distinct PUT(VisitDate,yymon7.) 
   into :DateList separated by ' '
   from trial.Visits
;
/*%put NOTE- DateList=%superq(datelist);*/
quit;

Next, I used a DATA step to transpose the data:

data n2w_datastep;
   /* Establish the PDV in the order we desire */
	length ID $8 &VARLIST 8;
	/* Retain values for all new variables */
	retain &VARLIST;
	/* Set up an array for each measurement */
	&arrays;
	set work.VisitsSorted;
	by id VisitDate;
	if first.id then do;
		/* Clean up before starting work on a new set of rows */
		call missing(%qsysfunc(translate(&varlist,',',' ')));
	end;
	/* The date of this record determines the array element to be loaded */
	element=findw("&DateList",PUT(VisitDate,yymon7.), ' ', 'E');
	/* Assign values to correct array elements */
	Heig[element]=Height;
	Weig[element]=Weight;
	Dias[element]=Diastolic;
	Syst[element]=Systolic;
	Smok[element]=SmokingAvg;
	Chol[element]=Cholesterol;
	if last.id then do;
		/* Finished with this ID - write output */
		output;
	end;
	drop element CHF--Cholesterol;
run;

This worked perfectly. The PROC SQL step ran in 0.25 seconds, and the DATA step in 0.2, for an overall time of less than half a second - much faster than the PROC TRANSPOSE process, if just a bit trickier to code. The nice thing about the DATA step is that I could have done additional processing, like computing overall statistics once the values for an ID are completely loaded.

The Pivot Point: Choosing a Method

I’ve used the DATA step to transpose data for years, mostly when other data manipulation was required during transposition. And for sheer speed and efficiency, the DATA step is hard to beat. But for simple, one-column transpositions PROC TRANSPOSE may be the easier choice.

Which technique for transposing data is your current “go to”? Has the code show here given you any new ideas?

Until next time, may the SAS be with you!

Mark

Grab the ZIP file containing the code and data for this series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip

Links to prior posts in this series:

Find more articles from SAS Global Enablement and Learning here.

Ksharp · ‎11-06-2024

MERGE skill is also a powerful and flexible tool to transpose data. I wrote a article about it before for somebody who need and interest it .

https://support.sas.com/resources/papers/proceedings15/2785-2015.pdf

SASJedi · ‎11-07-2024

@Ksharp - man, that's a great paper, and a very nice technique! Thanks for sharing 😊

nbrucken · ‎11-21-2024

Very helpful article! The DOW-Loop also comes in handy for narrow-to-wide and complex transpositions. I presented a paper showing some examples at PharmaSUG a few years back- "Getting Your Data in Shape with PROC TRANSPOSE" (https://www.lexjansen.com/pharmasug/2012/TF/PharmaSUG-2012-TF10.pdf)

ballardw · ‎11-21-2024

An example of the pitfalls, and solutions, that await attempting to transpose character and numeric values in one pass might be appropriate.