Re: Reshaping a large dataset from long to wide with many variables

jss539 · Posted 08-20-2020 02:28 PM

I have a large dataset in SAS that I need to change to wide from long.

Basically, the data is sorted with multiple observations for each property for about 10 different years with around 150 variables. I would like to make it so each variable is a different year and have one observation per property. Each variable is uniquely named and I want to keep the name in it.

What I have:

Property Year Var1 ... Var150

Prop1 2010 100 ... ABC

Prop1 2011 101 ... DEF

.

Prop1 2017 138 ... XYZ

Prop2 ...

What I want:

Property Var1_2010 ... Var1_2017 ... Var150_2010 ... Var150_2017

Prop1 100 ... 138 ... ABC ... XYZ

Prop2 ...

I tried using Proc Transpose, but I cannot find a way to concatenate the year at the end of each original variable name.

I also tried:

proc sort data=hotels;
by propertyID year;
run;

proc transpose data=hotels out=hotels_wide;
by propertyID year;
var _all_;
run;

proc transpose data=hotels_wide delimiter=_ out=hotels_wide_full(drop=_name_);
by propertyID;
var col1;
id _name_ year;
run;

But the last block gave me an error telling me there are multiple occurrences of each variable for each propertyID: ERROR: The ID value "RmsServiceChargeDistribution_2E3" occurs twice in the same BY group.

A shortcoming of this would also be losing the formats of each variable.

Cynthia_sas · Posted 08-21-2020 06:49 PM

Hi:
I don't see how this is going from WIDE to LONG. Right now, you have multiple rows for PROP1 and multiple rows for PROP2, one for each year. I would consider that already LONG. If you change it so that you have 1 value for Property and up to 150 separate variables for each year, that seems to me you are going from sort of wide, but mostly long to really, really wide and not long at all.

Or, to do the math based on your explanation, if you had a dataset with only 3 rows for Prop1 and 150 variables for each row then that's this:
PROP1 2010 with 150 variables
PROP1 2011 with 150 variables
PROP1 2012 with 150 variables

But, if you now collapse those 3 rows into just 1 row for PROP1, then you'll now have 1 row with 450 variables. That seems like a much wider dataset to me. What do you need to do with this data. Most SAS procedures would be much happier dealing with the data as you have it now, than in the structure you're proposing. What is the downstream process for this very, very wide table?

Cynthia

PaigeMiller · Posted 08-21-2020 08:17 PM

Converting this long data set to wide will give you a programming nightmare. There's no easy way to handle such a wide data set, other than hard coding all the variable names when you want to use them. You'd be much better off leaving the data as long, and then built in SAS methods, such as BY variables and PROC REPORT and many others, make handling such a long data set very easy. Why do you need this data set as a wide data set?

--
Paige Miller

Cynthia_sas · Posted 08-21-2020 08:18 PM

Hi:

The formats are the least of your worries. When you do a transpose with VAR _ALL_; if you have a mix of numeric and character variables, all your numeric variables will be converted to character.

As an example, consider this:

I think the result after the second transpose is what you want, but the challenge is that I only had 2 properties and 3 years for each property and then 4 variables (VAR1-VAR4) for each row. With 150 variables on each row, this would quickly become unmanageable and all your numeric variables would now be character too. The program actually does 2 transposes. The first transpose gets you a variable for year and then _NAME_ holds the value of the original variable name. Next, with a DATA step, you make a variable called NEWNAME that can be used in a second transpose,

But still I think your original structure is more useful for most SAS procedures.

Cynthia

here's the program I used to make the fake data:

 data fakedata;
  infile datalines dlm=',';
  input propertyid $ year var1 var2 $ var3 var4 $;
return;
datalines;
Prop1,2010,101,ABC,102,DEF
Prop1,2011,201,GHI,202,JKL
Prop1,2012,301,MNO,302,PQR
Prop2,2010,401,KJI,402,NML
Prop2,2011,501,OPQ,502,TSR
Prop2,2012,601,WUV,602,ZYX
;
run;

Tom · Posted 08-21-2020 10:28 PM

The code is fine. But the example does not reflect the actual data. The error messages makes if very clear what the issue is.

The original dataset has a variable with a name that is already waaaaaaay too long.

1234567890123456789012345678012
RmsServiceChargeDistribution

Since there is not room to add 5 more characters to the end of the existing name the year was reduced to 3 characters. But that means that the new name is no longer unique.

Either use more reasonable lengths for the variable names to start with, or you will need to add a step to create unique names for the new variables.

Reshaping a large dataset from long to wide with many variables