Dear Experts
I have a question regarding making a single variable into multiple variables
I have a variable (diagnoses1) that is defined in the following manor
date#diagnosis#typeofdiagnosis#date#diagnosis#typeofdiagnosis#...... and these pairs can be replicated for a maximum of 100 times, but I dont know exactly how many times for the individual patient;
The dates has the following format dd/mm/yyyy; The diagnosis is a stringvariable and the type of diagnosis is also a string variable.
So in real life an example could be
01/01/2001#df100#H#01/02/2001#df200#H#01/03/2001#...
What I would like is the following variables
d1date1 d1diagnosis1 d1type1
d1date2 d1diagnosis2 d1type2
...... for as many as there are in the initial variable diagnoses1.
I hope this makes sence.
I will look forward to seeing your suggestions
Regards
Solvej
Are the data items always present? if so:
data want; set have; do i=1 to countw(diag,"#") by 3; date=input(scan(diag,i,"#"),ddmmyy10.); diagnosis=scan(diag,i+1,"#"); type=scan(diag,i+3,"#"); output; end; run;
This will loop over the text and for each three items create the variables and output. Note, if there isn't always 3 then you will get errors. Post test data in the form of a datastep and required output, otherwise we have to guess.
Are the data items always present? if so:
data want; set have; do i=1 to countw(diag,"#") by 3; date=input(scan(diag,i,"#"),ddmmyy10.); diagnosis=scan(diag,i+1,"#"); type=scan(diag,i+3,"#"); output; end; run;
This will loop over the text and for each three items create the variables and output. Note, if there isn't always 3 then you will get errors. Post test data in the form of a datastep and required output, otherwise we have to guess.
Dear RW9
thank you for your answer.
It seems to be working. It does however produce a multiple of lines for each set of three. Can this be fixed in the statement or will I have to reshape in order to only have one line per initial variable?
This is what I have now
Diagnoses1 | date1 | diagnosis | type | ||||||
01/01/2001#df100#H#01/02/2001#df200#H#01/03/2001#... | 01-01-2001 | df100 | H | ||||||
01/01/2001#df100#H#01/02/2001#df200#H#01/03/2001#... | 01-02-2001 | df200 | H | ||||||
01/01/2001#df100#H#01/02/2001#df200#H#01/03/2001#... | 01-03-2001 | .. | .. | ||||||
What I would like | D1date1 | D1diagnosis1 | D1type1 | D1date2 | D1diagnosis2 | D1type2 | D1date3 | D1diagnosis3 | 1type3 |
01/01/2001#df100#H#01/02/2001#df200#H#01/03/2001#... | 01-01-2001 | df100 | H | 01-02-2001 | df200 | H | 01-03-2001 | .. | .. |
Kind regards
Solvej
Don't do that. Wide datasets are inherently harder to handle than long ones. By-group processing is easier than handling arrays, especially when the number of items for a group is not known beforehand.
Dear KurtBremser
I agree however for this specific task it would be very helpful onwards to have one line per patient.
Regards
Solvej
@Solvej wrote:
Dear KurtBremser
I agree however for this specific task it would be very helpful onwards to have one line per patient.
Regards
Solvej
No, it really doesn't. Having worked with health care data for years...it really really does not help. There are ways to account for things in a long format that is MUCH easier - once you learn it. Otherwise you'll be stuck using a lot of array loops.
That is by design. It is never a good idea to work with transposed data within programming. Industry standards - CDISC - are normalised also. The reason is that any transposed dataset requires far more programming and messy code to work with than normalised, and at the end of the day if you need a transposed dataset, then it is just a proc transpose step away. So from my original code:
data want; set have; ord=0; do i=1 to countw(diag,"#") by 3; ord=ord+1; date=input(scan(diag,i,"#"),ddmmyy10.); diagnosis=scan(diag,i+1,"#"); type=scan(diag,i+3,"#"); output; end; run; proc transpose data=want out=tran1 prefix=d1date; by ...; var date; id ord; run; proc transpose data=want out=tran2 prefix=d1diagnosis; by ...; var date; id ord; run; proc transpose data=want out=tran3 prefix=d1type; by ...; var date; id ord; run; data final; merge tran1 tran2 tran3; by ...; run;
Note I have put elipses in the by clause, as you have not shown any other variables so I don't know these.
Again, I really advise you not to transpose any data unless it is really needed - i.e. in a final output report or something - no good will come of it.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.