Hello
i have a dataset with 100k rows and 20 variables
Each row represents a hospital visit by a persons - but i would like to combine the visits onto 1 row. i have been looking at proc transpose and proc array but i am not getting what i would like.
has anyone come across this kind of data manipulation before - any advice would be great
thanks
Lorraine
So you have multiple variables to transpose up, so proc transpose would be cumbersome, hence we go into arrays.
Will there only be a maximum of 2 rows per identifier? If so then:
Data have ; input identifier house area $ visit $ Treatment $; cards; 66 33 dublin Jan yes 1300 22 cork March yes 1300 22 cork May yes 2246 4 galway July yes 2246 4 galway October yes 1449 66 dublin November no ; run; proc sort data=have; by identifier; run; data want (drop=element h a v t); set have (rename=(house=h area=a visit=v treatment=t)); by identifier; array house{2} 8.; array area{2} $20; array visit{2} $20; array treatment{2} $20; retain element house: area: visit: treatment:; if first.identifier then do; call missing(of house:,of area:,of visit:,of treatment:); element=1; end; else element=element+1; house{element}=h; area{element}=a; visit{element}=v; treatment{element}=t; if last.identifier then output; run;
If there are more than two, just increase 2 to whatever you need, or if you want extra glory, precalculate how many you need and drop that in a macro variable and use that.
Generally speaking it's much easier to program with a long data set than a wide one but if you must do it then Proc Transpose is the best way. Can you post a data step with some sample data (not an excel file) and let us know how you want it transposed.
Data have ;
input identifier house area $ visit $ Treatment $;
cards;
66 33 dublin Jan yes
1300 22 cork March yes
1300 22 cork May yes
2246 4 galway July yes
2246 4 galway October yes
1449 66 dublin November no
;
run;
data want | |||||||||
identifier | house | area | visit | Treatment | identifier2 | house2 | area2 | visit | Treatment2 |
66 | 33 | dublin | January | yes | |||||
1300 | 22 | cork | March | yes | 1300 | 22 | cork | May | yes |
2246 | 4 | galway | July | yes | 2246 | 4 | galway | October | yes |
1449 | 66 | dublin | November | no |
thanks for your replies
So you have multiple variables to transpose up, so proc transpose would be cumbersome, hence we go into arrays.
Will there only be a maximum of 2 rows per identifier? If so then:
Data have ; input identifier house area $ visit $ Treatment $; cards; 66 33 dublin Jan yes 1300 22 cork March yes 1300 22 cork May yes 2246 4 galway July yes 2246 4 galway October yes 1449 66 dublin November no ; run; proc sort data=have; by identifier; run; data want (drop=element h a v t); set have (rename=(house=h area=a visit=v treatment=t)); by identifier; array house{2} 8.; array area{2} $20; array visit{2} $20; array treatment{2} $20; retain element house: area: visit: treatment:; if first.identifier then do; call missing(of house:,of area:,of visit:,of treatment:); element=1; end; else element=element+1; house{element}=h; area{element}=a; visit{element}=v; treatment{element}=t; if last.identifier then output; run;
If there are more than two, just increase 2 to whatever you need, or if you want extra glory, precalculate how many you need and drop that in a macro variable and use that.
thanks so much for this - i will give it a go.
the number of visits will vary per person from 1 up to maybe 10
thanks again
Lorraine
Please post test data in the form of a datastep, Excel neither conveys structure, nor is it safe for users to download. You can use this post;
To get a datastep version of your test data, or just type it in quickly.
Also a good idea to post what the output should look like. Proc transpose should be able to handle any normal to transposed assuming only one variable, if more than one variable then arrays is the way to go.
i've some across %multitranspose - which uses macros - anyone familiar with this? macros are not my best friend!
http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/multitranspose.html#sascode
Please describe what you are going to do with that data after you have made it wide.
Since you are going to have different numbers of variables for each patient you'll likely need to adjust logic for the differing number of variables adding much difficulty to any analysis logic.
And if this is for a report it may be easier to use the data directly in a report that will handle repeated values as a column header such as adding a visit number per patient and then using that as a column group variable.
thansk for the reply - i've been on holidays so only getting back to this now
i am essentially converting hospital data - which is currently structured by visit - and i want to structure it by patient, so that all the visit information is on the one row. then i will carry out analysis on the patient in SAS
Its unlikely that having visit data as columns would make your analysis any easier. The CDISC models, pretty standard for clinical data are normalised as this simplifies selection and analysis data. For instance, if you had:
USUBJID VISIT1 VISIT2 VISIT3 ...
Then your programming would need to know the number of visits, an early program may have 3, but full data may have more, therefor you have to program for all these eventualities. The same thing as
USUBJID VISIT
...
Uses just those two variables, and then runs on the data that can be as long or short as needed, and with by group processing, you can do analysis - such as means - per visit as:
proc means data=...; by visit; ... run;
Which will produce results per visit.
There are also going to be a lot of empty cells if most people only have 2 or 3 visits with a tiny number having the full 10. This means a much bigger file with a lot of wasted disk space and longer processing times. In short while I agree "wide" is best for Excel "long" is better for almost any other environment including SAS.
i agree Chris - higher powers want it set up like this though!!
thanks all for your help on this - the SAS skills are a bit rusty so looking forward to increase my learning 🙂
I feel your pain - I've been there also 🙂
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.