Solved: Re: long to wide

Lorraine22 · Posted 07-28-2017 08:18 AM

Hello

i have a dataset with 100k rows and 20 variables

Each row represents a hospital visit by a persons - but i would like to combine the visits onto 1 row. i have been looking at proc transpose and proc array but i am not getting what i would like.

has anyone come across this kind of data manipulation before - any advice would be great

thanks

Lorraine

RW9 · Posted 07-28-2017 10:35 AM

So you have multiple variables to transpose up, so proc transpose would be cumbersome, hence we go into arrays.

Will there only be a maximum of 2 rows per identifier? If so then:

Data have ;
input identifier house area $ visit $ Treatment $;
cards;
66 33 dublin Jan yes
1300 22 cork March yes
1300 22 cork May yes
2246 4 galway July yes
2246 4 galway October yes
1449 66 dublin November no
;
run;

proc sort data=have;
  by identifier;
run;

data want (drop=element h a v t);
  set have (rename=(house=h area=a visit=v treatment=t));
  by identifier;
  array house{2} 8.;
  array area{2} $20;
  array visit{2} $20;
  array treatment{2} $20;
  retain element house: area: visit: treatment:;
  if first.identifier then do;
    call missing(of house:,of area:,of visit:,of treatment:);
    element=1;
  end;
  else element=element+1;
  house{element}=h;
  area{element}=a;
  visit{element}=v;
  treatment{element}=t;
  if last.identifier then output;
run;

If there are more than two, just increase 2 to whatever you need, or if you want extra glory, precalculate how many you need and drop that in a macro variable and use that.

View solution in original post

ChrisBrooks · Posted 07-28-2017 08:24 AM

Generally speaking it's much easier to program with a long data set than a wide one but if you must do it then Proc Transpose is the best way. Can you post a data step with some sample data (not an excel file) and let us know how you want it transposed.

Lorraine22 · Posted 07-28-2017 09:41 AM

Data have ;

input identifier house area $ visit $ Treatment $;

cards;
66 33 dublin Jan yes
1300 22 cork March yes
1300 22 cork May yes
2246 4 galway July yes
2246 4 galway October yes
1449 66 dublin November no

;
run;

data want
identifier	house	area	visit	Treatment	identifier2	house2	area2	visit	Treatment2
66	33	dublin	January	yes
1300	22	cork	March	yes	1300	22	cork	May	yes
2246	4	galway	July	yes	2246	4	galway	October	yes
1449	66	dublin	November	no

thanks for your replies

RW9 · Posted 07-28-2017 10:35 AM

So you have multiple variables to transpose up, so proc transpose would be cumbersome, hence we go into arrays.

Will there only be a maximum of 2 rows per identifier? If so then:

Data have ;
input identifier house area $ visit $ Treatment $;
cards;
66 33 dublin Jan yes
1300 22 cork March yes
1300 22 cork May yes
2246 4 galway July yes
2246 4 galway October yes
1449 66 dublin November no
;
run;

proc sort data=have;
  by identifier;
run;

data want (drop=element h a v t);
  set have (rename=(house=h area=a visit=v treatment=t));
  by identifier;
  array house{2} 8.;
  array area{2} $20;
  array visit{2} $20;
  array treatment{2} $20;
  retain element house: area: visit: treatment:;
  if first.identifier then do;
    call missing(of house:,of area:,of visit:,of treatment:);
    element=1;
  end;
  else element=element+1;
  house{element}=h;
  area{element}=a;
  visit{element}=v;
  treatment{element}=t;
  if last.identifier then output;
run;

If there are more than two, just increase 2 to whatever you need, or if you want extra glory, precalculate how many you need and drop that in a macro variable and use that.

Lorraine22 · Posted 07-28-2017 10:57 AM

thanks so much for this - i will give it a go.

the number of visits will vary per person from 1 up to maybe 10

thanks again

Lorraine

RW9 · Posted 07-28-2017 09:00 AM

Please post test data in the form of a datastep, Excel neither conveys structure, nor is it safe for users to download. You can use this post;

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

To get a datastep version of your test data, or just type it in quickly.

Also a good idea to post what the output should look like. Proc transpose should be able to handle any normal to transposed assuming only one variable, if more than one variable then arrays is the way to go.

Lorraine22 · Posted 07-28-2017 10:09 AM

i've some across %multitranspose - which uses macros - anyone familiar with this? macros are not my best friend!

http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/multitranspose.html#sascode

ballardw · Posted 07-28-2017 11:48 AM

Please describe what you are going to do with that data after you have made it wide.

Since you are going to have different numbers of variables for each patient you'll likely need to adjust logic for the differing number of variables adding much difficulty to any analysis logic.

And if this is for a report it may be easier to use the data directly in a report that will handle repeated values as a column header such as adding a visit number per patient and then using that as a column group variable.

Lorraine22 · Posted 08-15-2017 06:49 AM

thansk for the reply - i've been on holidays so only getting back to this now

i am essentially converting hospital data - which is currently structured by visit - and i want to structure it by patient, so that all the visit information is on the one row. then i will carry out analysis on the patient in SAS

RW9 · Posted 08-15-2017 07:13 AM

Its unlikely that having visit data as columns would make your analysis any easier. The CDISC models, pretty standard for clinical data are normalised as this simplifies selection and analysis data. For instance, if you had:

USUBJID VISIT1 VISIT2 VISIT3 ...

Then your programming would need to know the number of visits, an early program may have 3, but full data may have more, therefor you have to program for all these eventualities. The same thing as

USUBJID VISIT

...

Uses just those two variables, and then runs on the data that can be as long or short as needed, and with by group processing, you can do analysis - such as means - per visit as:

proc means data=...;
  by visit;
...
run;

Which will produce results per visit.

ChrisBrooks · Posted 08-15-2017 08:09 AM

There are also going to be a lot of empty cells if most people only have 2 or 3 visits with a tiny number having the full 10. This means a much bigger file with a lot of wasted disk space and longer processing times. In short while I agree "wide" is best for Excel "long" is better for almost any other environment including SAS.

Lorraine22 · Posted 08-16-2017 03:53 AM

i agree Chris - higher powers want it set up like this though!!

Lorraine22 · Posted 08-16-2017 03:55 AM

thanks all for your help on this - the SAS skills are a bit rusty so looking forward to increase my learning 🙂

ChrisBrooks · Posted 08-16-2017 11:48 AM

I feel your pain - I've been there also 🙂

Registration is open

SAS Training: Just a Click Away