Hi,
I am working on a large dataset that lists all follow-up conditions in a range of columns using ICD10 codes. ICD10 codes thankfully have a similar pattern where there is a letter and then a range of number for similar disorders (e.g. Cancer is C000 to C969). I want to go through each column and create a new variable if the person has had one of those types of disorders. I have provided an example below. Disease_X_0 is the column with the ICD10 code, and Cancer is the new variable I would like to create.
ID | Disease_1_0 | Disease_2_0 | Disease_3_0 | Disease_4_0 | Cancer |
1 | C001 | B005 | 1 | ||
2 | D55 | C97 | 1 | ||
3 | C97 | 1 | |||
4 | K00 | 0 | |||
5 | D57 | 0 |
I feel like there is an easy way to do this but cannot seem to find it with all my googling! Thank you so much in advance for your time!
you’re welcome !
You can mention a range as follows in the prxmatch function:
/C[0-500]/
If the ICD code must have 3 digits following the C letter, you can add this check :
AND prxmatch(‘/C\d\d\d/‘ ...)
if prxmatch('/C\d\d\d/',_Disease(i))>0 and prxmatch('/C[0-500]/',_Disease(i))>0 then cancer=1;
Hi,
I am working on a large dataset that lists all follow-up conditions in a range of columns using ICD10 codes. ICD10 codes thankfully have a similar pattern where there is a letter and then a range of number for similar disorders (e.g. Cancer is C000 to C969). I want to go through each column and create a new variable if the person has had one of those types of disorders. I have provided an example below. Disease_X_0 is the column with the ICD10 code, and Cancer is the new variable I would like to create.
ID | Disease_1_0 | Disease_2_0 | Disease_3_0 | Disease_4_0 | Cancer |
1 | C001 | B005 | 1 | ||
2 | D55 | C97 | 1 | ||
3 | C97 | 1 | |||
4 | K00 | 0 | |||
5 | D57 | 0 |
I feel like there is an easy way to do this but cannot seem to find it with all my googling! Thank you so much in advance for your time!
Hi @RAGC
You can try the below code. Please avoid duplicate posts 😉
The array function enables to perform the same manipulation on multiple variables.
Hope this helps!
Best,
data have;
infile datalines dlm="," dsd missover;
input ID Disease_1_0 $ Disease_2_0 $ Disease_3_0 $ Disease_4_0 $;
datalines;
1,C001,B005,,
2,D55, ,C97,
3,,C97,,,
4,K00,,,,
5,,D57,,,
;
run;
data want;
set have;
array _Disease (*) Disease_:;
cancer=0;
do i=1 to dim(_Disease);
if prxmatch('/C\d+/',_Disease(i))>0 then cancer=1; /* \d+ mean 1 digit or more */
end;
drop i;
run;
Hi @RAGC
You can try the below code.
The array function enables to perform the same manipulation on multiple variables.
Hope this helps!
Best,
data have;
infile datalines dlm="," dsd missover;
input ID Disease_1_0 $ Disease_2_0 $ Disease_3_0 $ Disease_4_0 $;
datalines;
1,C001,B005,,
2,D55, ,C97,
3,,C97,,,
4,K00,,,,
5,,D57,,,
;
run;
data want;
set have;
array _Disease (*) Disease_:;
cancer=0;
do i=1 to dim(_Disease);
if prxmatch('/C\d+/',_Disease(i))>0 then cancer=1; /* \d+ mean 1 digit or more */
end;
drop i;
run;
Hi @ed_sas_member,
My sincerest apologies for the duplicate posting! I realized I posted in the wrong section, and thought I had deleted it before posting it in here (which I hope is the appropriate section).
Your code worked super well! Thank you so much! If you would indulge me for one step further that I am struggling with. I see that you mentioned the d+ is for anything greater than 1 digit. If there a way to specify a range of the values (e.g. C001 to C500) even though there are up to C969 available characters?.
Best,
Rebecca
you’re welcome !
You can mention a range as follows in the prxmatch function:
/C[0-500]/
If the ICD code must have 3 digits following the C letter, you can add this check :
AND prxmatch(‘/C\d\d\d/‘ ...)
if prxmatch('/C\d\d\d/',_Disease(i))>0 and prxmatch('/C[0-500]/',_Disease(i))>0 then cancer=1;
Sorry @RAGC Late to the party. Anyways, for what it's worth
data have;
infile datalines dlm="," dsd missover;
input ID Disease_1_0 $ Disease_2_0 $ Disease_3_0 $ Disease_4_0 $;
datalines;
1,C001,B005,,
2,D55, ,C97,
3,,C97,,,
4,K00,,,,
5,,D57,,,
;
run;
data want;
set have;
array c(969) _temporary_ (1:969);
array t Disease_1_0--Disease_4_0;
do _n_=1 to dim(t) until(cancer=1);
cancer= first(t(_n_))='C' and input(compress(t(_n_),,'kd'),best.) in c;
end;
run;
proc print noobs;run;
ID | Disease_1_0 | Disease_2_0 | Disease_3_0 | Disease_4_0 | cancer |
---|---|---|---|---|---|
1 | C001 | B005 | 1 | ||
2 | D55 | C97 | 1 | ||
3 | C97 | 1 | |||
4 | K00 | 0 | |||
5 | D57 | 0 |
Oops!, I took the list as 1:969 where I missed to start from 0 i.e. 0:969.
Correction:
data have;
infile datalines dlm="," dsd missover;
input ID Disease_1_0 $ Disease_2_0 $ Disease_3_0 $ Disease_4_0 $;
datalines;
1,C000,B005,,
2,D55, ,C97,
3,,C97,,,
4,K00,,,,
5,,D57,,,
;
run;
data want;
set have;
array c(000:969) _temporary_ (0:969);
array t Disease_1_0--Disease_4_0;
do _n_=1 to dim(t) until(cancer=1);
cancer= first(t(_n_))='C' and input(compress(t(_n_),,'kd'),best.) in c;
end;
run;
proc print noobs;run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.