BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Lorraine22
Obsidian | Level 7

Hello

 

i have a dataset with 100k rows and 20 variables

Each row represents a hospital visit by a persons - but i would like to combine the visits onto 1 row. i have been looking at proc transpose and proc array but i am not getting what i would like.

has anyone come across this kind of data manipulation before - any advice would be great

thanks

Lorraine

1 ACCEPTED SOLUTION

Accepted Solutions
RW9
Diamond | Level 26 RW9
Diamond | Level 26

So you have multiple variables to transpose up, so proc transpose would be cumbersome, hence we go into arrays.

Will there only be a maximum of 2 rows per identifier?  If so then:

Data have ;
input identifier house area $ visit $ Treatment $;
cards;
66 33 dublin Jan yes
1300 22 cork March yes
1300 22 cork May yes
2246 4 galway July yes
2246 4 galway October yes
1449 66 dublin November no
;
run;

proc sort data=have;
  by identifier;
run;

data want (drop=element h a v t);
  set have (rename=(house=h area=a visit=v treatment=t));
  by identifier;
  array house{2} 8.;
  array area{2} $20;
  array visit{2} $20;
  array treatment{2} $20;
  retain element house: area: visit: treatment:;
  if first.identifier then do;
    call missing(of house:,of area:,of visit:,of treatment:);
    element=1;
  end;
  else element=element+1;
  house{element}=h;
  area{element}=a;
  visit{element}=v;
  treatment{element}=t;
  if last.identifier then output;
run;

If there are more than two, just increase 2 to whatever you need, or if you want extra glory, precalculate how many you need and drop that in a macro variable and use that.

 

View solution in original post

13 REPLIES 13
ChrisBrooks
Ammonite | Level 13

Generally speaking it's much easier to program with a long data set than a wide one but if you must do it then Proc Transpose is the best way. Can you post a data step with some sample data (not an excel file) and let us know how you want it transposed.

Lorraine22
Obsidian | Level 7


Data have ;

input identifier house area $ visit $ Treatment $;

cards;
66 33 dublin Jan yes
1300 22 cork March yes
1300 22 cork May yes
2246 4 galway July yes
2246 4 galway October yes
1449 66 dublin November no

;
run;

 

data want         
identifierhouseareavisitTreatmentidentifier2house2area2visitTreatment2
6633dublinJanuaryyes     
130022corkMarchyes130022corkMayyes
22464galwayJulyyes22464galwayOctoberyes
144966dublinNovemberno     

 

thanks for your replies

RW9
Diamond | Level 26 RW9
Diamond | Level 26

So you have multiple variables to transpose up, so proc transpose would be cumbersome, hence we go into arrays.

Will there only be a maximum of 2 rows per identifier?  If so then:

Data have ;
input identifier house area $ visit $ Treatment $;
cards;
66 33 dublin Jan yes
1300 22 cork March yes
1300 22 cork May yes
2246 4 galway July yes
2246 4 galway October yes
1449 66 dublin November no
;
run;

proc sort data=have;
  by identifier;
run;

data want (drop=element h a v t);
  set have (rename=(house=h area=a visit=v treatment=t));
  by identifier;
  array house{2} 8.;
  array area{2} $20;
  array visit{2} $20;
  array treatment{2} $20;
  retain element house: area: visit: treatment:;
  if first.identifier then do;
    call missing(of house:,of area:,of visit:,of treatment:);
    element=1;
  end;
  else element=element+1;
  house{element}=h;
  area{element}=a;
  visit{element}=v;
  treatment{element}=t;
  if last.identifier then output;
run;

If there are more than two, just increase 2 to whatever you need, or if you want extra glory, precalculate how many you need and drop that in a macro variable and use that.

 

Lorraine22
Obsidian | Level 7

thanks so much for this - i will give it a go.

the number of visits will vary per person from 1 up to maybe 10

 

thanks again

Lorraine

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Please post test data in the form of a datastep, Excel neither conveys structure, nor is it safe for users to download.  You can use this post;

https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

 

To get a datastep version of your test data, or just type it in quickly.

 

Also a good idea to post what the output should look like.  Proc transpose should be able to handle any normal to transposed assuming only one variable, if more than one variable then arrays is the way to go.

Lorraine22
Obsidian | Level 7

i've some across %multitranspose - which uses macros - anyone familiar with this? macros are not my best friend!

 

http://www.medicine.mcgill.ca/epidemiology/joseph/pbelisle/multitranspose.html#sascode

ballardw
Super User

Please describe what you are going to do with that data after you have made it wide.

 

Since you are going to have different numbers of variables for each patient you'll likely need to adjust logic for the differing number of variables adding much difficulty to any analysis logic.

 

And if this is for a report it may be easier to use the data directly in a report that will handle repeated values as a column header such as adding a visit number per patient and then using that as a column group variable.

Lorraine22
Obsidian | Level 7

thansk for the reply - i've been on holidays so only getting back to this now

 

i am essentially converting hospital data - which is currently structured by visit - and i want to structure it by patient, so that all the visit information is on the one row. then i will carry out analysis on the patient in SAS

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Its unlikely that having visit data as columns would make your analysis any easier.  The CDISC models, pretty standard for clinical data are normalised as this simplifies selection and analysis data.  For instance, if you had:

USUBJID VISIT1 VISIT2 VISIT3 ...

 

Then your programming would need to know the number of visits, an early program may have 3, but full data may have more, therefor you have to program for all these eventualities.  The same thing as

USUBJID  VISIT

...

Uses just those two variables, and then runs on the data that can be as long or short as needed, and with by group processing, you can do analysis - such as means - per visit as:

proc means data=...;
  by visit;
...
run;

Which will produce results per visit.

ChrisBrooks
Ammonite | Level 13

There are also going to be a lot of empty cells if most people only have 2 or 3 visits with a tiny number having the full 10. This means a much bigger file with a lot of wasted disk space and longer processing times. In short while I agree "wide" is best for Excel "long" is better for almost any other environment including SAS.

Lorraine22
Obsidian | Level 7

i agree Chris - higher powers want it set up like this though!!

Lorraine22
Obsidian | Level 7

thanks all for your help on this - the SAS skills are a bit rusty so looking forward to increase my learning 🙂

ChrisBrooks
Ammonite | Level 13

I feel your pain - I've been there also 🙂

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2849 views
  • 4 likes
  • 4 in conversation