Solved: Re: Reshaping a large data set from long to wide

mcdj · Posted 10-19-2015 04:05 PM

Hi everyone,

I am trying to reshape my large datset with lots of missing data and variable names that are mixed character and numeric (none of that I want to drop). I have read about both the PROC TRANSPOSE and the ARRAY options to reshape from long to wide and both seem to include the need to write out all of the names of my hundreds of variables. I will need to reshape the data by two different variables ID and Sex in order to make the VISITDATE variable that is currently long, wide. The issue is that many of the VISITDATEs are a different date over a 10 year period in the format DDmonthYY (12march04). Because of this I am thinking I can somehow reshape and make new vars VISITDATE1-VISITDATE20 (the max in the dataset) but having a hard time wrapping my head around the actual array coding.

Can some brilliant programmer in dates and arrays help out?

Thank you so much in advance,

Megan

slangan · Posted 10-21-2015 08:41 AM

There is a macro that transposes several variables at once. I've never used it.

%MultiTranspose

http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/multitranspose.html

View solution in original post

ballardw · Posted 10-19-2015 06:36 PM

One: some example data and what the final output should be is helpful.

Two: What will you do next with the data? If you don't want to write variable names when creating them the issue will magnify when trying to manipulate them.

If this is a report then proc tabulate and proc report both will generate a column for each level of a variable defined to create groups (report) or class (tabulate).

If the variables are named "nicely" then variable lists do help. If you have a bunch of variables of the same typle that all start with DATE, then DATE: <= note the colon, will allow you to assign them all to an array easily, or if they have sequential names such as Var1 to Var23 then a range list like Var1 - Var23 does similar.

A little concerned about ID and SEX involved with setting a Date variable though.

mcdj · Posted 10-20-2015 11:42 AM

HI there, thanks for replying. To answer your Qs:

1) Please note that the HTID is the SAME for 2 people (based on a couples study) so I am trying to make individual IDs and you have to do this by sorting by htid and sex. Also note there are multiple observtaion (rows) for each member of the couple (HTID) at different times. Also the variables are unfortuantely not named in a way where I can use an array var1-var25 so will likely need to rename them temporarily to get the data in shape.

Example data have:

HTID SEX AGE VISITDATE var4 var5 var6....var200

5014 M 35 10march2002

5014 M 35 02june2002

5014 M 35 18sept2002

5014 M 35 02jan2003

5014 F 30 10march2002

5014 F 30 26june2002

5023 M 28 20june2003

5023 M 28 21sept2003

5023 M 28 5jan2004

Example data want: each individual with their own row with each visitdate as an unique event

ID HTID SEX AGE VISITDATE1 VISITDATE2 VISITDATE3.....

1 5014 M 35 10march2002 02june2002 18sept2002

2 5014 F 30 10march2002 26june2002

...

2) I am happy to write variable names if that is the only way, I just wasn't sure. I thought maybe it would be easier to rename them to some convention like var1-var200 in order to get the data in shape first with arrays, etc.

Eventually, after descriptive stats, etc. I will be be perfoming some multivariate logistic regressions and build marginal structural models etc.

Patrick · Posted 10-20-2015 03:39 AM

Please provide some sample data "want" and explain us what you need and why.

It's most of the time much better to have data in a long instead of a wide structure so I would like to understand why you want it the other way round.

mcdj · Posted 10-20-2015 11:44 AM

Hi Patrick,

Please note that the HTID is the SAME for 2 people (based on a couples study) so I am trying to make individual IDs and you have to do this by sorting by htid and sex. Also note there are multiple observtaion (rows) for each member of the couple (HTID) at different times. Also the variables are unfortuantely not named in a way where I can use an array var1-var25 so will likely need to rename them temporarily to get the data in shape.

Example data have:

HTID SEX AGE VISITDATE var4 var5 var6....var200

5014 M 35 10march2002

5014 M 35 02june2002

5014 M 35 18sept2002

5014 M 35 02jan2003

5014 F 30 10march2002

5014 F 30 26june2002

5023 M 28 20june2003

5023 M 28 21sept2003

5023 M 28 5jan2004

Example data want: each individual with their own row with each visitdate as an unique event

ID HTID SEX AGE VISITDATE1 VISITDATE2 VISITDATE3.....

1 5014 M 35 10march2002 02june2002 18sept2002

2 5014 F 30 10march2002 26june2002

...

Eventually, after descriptive stats, etc. I will be be perfoming some multivariate logistic regressions and build marginal structural models etc.

Tom · Posted 10-20-2015 12:53 PM

You can use two variables to uniquely define your individuals. PROC MEANS; BY HTID SEX; for example.

Or you can make a new unique id variable if you want.

data want ;

set have;

by htid sex;

uid + first.sex ;

run;

mcdj · Posted 10-20-2015 04:00 PM

Thank you. So that will just give me a unique ID. But what about then reshaping VISIDATE? Especially since I want to transpose multiple variables (var4_visit1, var4_visit2, etc. ) by visitdate? Any idea if that is possible? THanks again.

slangan · Posted 10-20-2015 04:03 PM

I wouldn't want to try to transpose more than 2 variables especially in a large dataset. Since you mentioned that you have missing data and mixed formated data, you might end up with a lot of problems if you try to reshape multiple variables.

slangan · Posted 10-21-2015 08:41 AM

There is a macro that transposes several variables at once. I've never used it.

%MultiTranspose

http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/multitranspose.html

MikeZdeb · Posted 10-22-2015 11:20 AM

Hi, using the data youi posted ...

data x;
input htid sex :$1. age visitdate :anydtdte.;
datalines;
5014 M 35 10MARCH2002
5014 M 35 02JUNE2002
5014 M 35 18SEPT2002
5014 M 35 02JAN2003
5014 F 30 10MARCH2002
5014 F 30 26JUNE2002
5023 M 28 20JUNE2003
5023 M 28 21SEPT2003
5023 M 28 5JAN2004
;

proc sql noprint;
select max(obs) into :nobs from (select count(*) as obs from x group by htid, sex, age);
quit;

proc summary nway data=x missing;
class htid sex age;
output out=x_wide(drop=_type_ _freq_) idgroup(out[&nobs] (visitdate)=);
format v: date9.;
run;

data set X_WIDE ...

visitdate_ visitdate_ visitdate_ visitdate_
Obs htid sex age 1 2 3 4

1     5014     F      30    10MAR2002     26JUN2002             .             .
2     5014     M      35    10MAR2002     02JUN2002     18SEP2002     02JAN2003
3     5023     M      28    20JUN2003     21SEP2003     05JAN2004             .

Take a look at ...

Transposing Data Using PROC SUMMARY'S IDGROUP Option

http://support.sas.com/resources/papers/proceedings10/102-2010.pdf

mcdj · Posted 10-23-2015 05:54 PM

thanks, this worked too! I ended up using the Macro in the end but just checked it against your code which was great. Thanks!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away