I have a large dataset of patients with multiple diagnosis codes each. I want to create a new variable 'Final_Dx' that contains the most relevant diagnosis based on an ordered preference. Thank you!
Hierarchical Dx Order
1) Cancer
2) MI
3) Diabetes
4) Obesity
5) COPD
CaseID |
Dx1 |
Dx2 |
Dx3 |
FINALDX(Desired) |
123 |
Cancer |
Obesity |
|
Cancer |
456 |
COPD |
Obesity |
|
Obesity |
789 |
Diabetes |
MI |
COPD |
MI |
112 |
Other |
COPD |
|
COPD |
proc format;
invalue fmt(upcase)
CANCER=1
MI=2
DIABETES=3
OBESITY=4
COPD=5
;
run;
data have;
input caseid Dx1 $ Dx2 $ Dx3 $ ;
cards;
123 Cancer Obesity .
456 COPD Obesity .
789 Diabetes MI COPD
112 Other COPD .
;
run;
data want;
set have;
length FINALDX $ 40;
min=999999;
array x{*} Dx:;
do i=1 to dim(x);
n=input(x{i},?? fmt.);
if not missing(n) and n<min then do;min=n;FINALDX=x{i};end;
end;
drop i n min;
run;
Please provide example data in the form of data step code. The stuff you posted doesn't even like to paste into an editor to right a data step.
Untested because of no data:
data want;
set have;
array dx(*) dx1-dx3;
array f (5) $ 20 _temporary_ ('COPD','Obesity','Diabetes','MI','Cancer');
if cmiss(of dx(*)) < dim(dx) then do;
pos = max(whichc(dx1,of f(*)),whichc(dx2,of f(*)),whichc(dx3,of f(*)));
if pos then Finaldiagnosis = f[pos];
end;
drop pos;
run;
The WhichC, and numeric equivalent WhichN, return the number position in a list of values. The array F has the values with the most important (assuming 1 meant such) as the last position because the Whichc function returns a 0 if not found and using a Min function to attempt to return 1 wouldn't work with a lot of extra coding. The MAX function however has no problem so returns the largest value when found and if not zero can be used as the index into the temporary array holding the list desired diagnosis values. The CMISS is so the searchis only even attempted if at least one DX value is found.
In the absence of a working data step for sample data, here is untested code.
data want;
set have;
array _preferences {5} $8 _temporary_ ("Cancer","MI","Diabetes","Obesity","COPD");
do i=1 to 5 until ( whichc(_preferences{i}, of dx:));
end;
if i<6 then finaldx=_preferences{i};
run;
I figured it out. Here is the test data and syntax.
data have;
input caseid Dx1 $ Dx2 $ Dx3 $ ;
cards;
123 Cancer Obesity .
456 COPD Obesity .
789 Diabetes MI COPD
112 Other COPD .
;
run;
data want;
set have;
array dx (3) $20 Dx1 - Dx3;
do i = 1 to dim(dx);
if dx(i)= 'Cancer' then Cancer=1;
else if Cancer=. then Cancer=0;
if dx(i)= 'Obesity' then Obesity=1;
else if Obesity=. then Obesity=0;
if dx(i) = 'COPD' then COPD=1;
else if COPD=. then COPD=0;
if dx(i) = 'Diabetes' then Diabetes=1;
else if Diabetes=. then Diabetes=0;
if dx(i) ='MI' then MI=1;
else if MI=. then MI=0;
if dx(i) ='Other' then Other=1;
else if Other=. then Other=0;
end;
if Cancer=1 then FinalDx=1;
else if MI =1 then FinalDx=2;
else if Diabetes=1 then FinalDx=3;
else if Obesity=1 then FinalDx=4;
else if COPD=1 then FinalDx=5;
else FinalDx=6;
drop i;
format FinalDx commonhier.;
Run;
I will say "Good for you" for getting a solution.
I will add "I don't want to see the solution for 15 diagnosis fields and 10 in the priority list".
I agree - my solution worked but is a bit clunky. I love improving my SAS skills via this community!
proc format;
invalue fmt(upcase)
CANCER=1
MI=2
DIABETES=3
OBESITY=4
COPD=5
;
run;
data have;
input caseid Dx1 $ Dx2 $ Dx3 $ ;
cards;
123 Cancer Obesity .
456 COPD Obesity .
789 Diabetes MI COPD
112 Other COPD .
;
run;
data want;
set have;
length FINALDX $ 40;
min=999999;
array x{*} Dx:;
do i=1 to dim(x);
n=input(x{i},?? fmt.);
if not missing(n) and n<min then do;min=n;FINALDX=x{i};end;
end;
drop i n min;
run;
If you want to extend to any number of diagnoses, here is a way to do so with SQL :
data have;
input caseId (Dx1 - Dx3) ($) ;
datalines;
123 Cancer Obesity .
456 COPD Obesity .
789 Diabetes MI COPD
112 Other COPD .
;
data order;
input diag $ order;
datalines;
CANCER 1
MI 2
DIABETES 3
OBESITY 4
COPD 5
;
data haveLong;
set have;
array d $ dx:;
do IdDx = 1 to dim(d);
if not missing (d{IdDx}) then do;
Diag = d{IdDx};
output;
end;
end;
keep caseId IdDx Diag;
run;
proc sql;
create table want as
select
have.*, final_Dx
from
have left join
(select
caseId,
a.Diag as final_Dx
from
haveLong as a inner join
order as b on upcase(a.diag) = b.diag
group by caseId
having order = min(order) ) as c
on have.caseId = c.caseId;
quit;
Thanks PGStats! This works perfectly too but I was only able to accept one solution.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.