how to write regular expression to identify cases coded using ICD-10-CM?
there are 9 diagnosis codes, as long as one diagnosis code meets the definition, then disease=1
((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)
thanks
If the expression you want to use is the one you gave, just write:
DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE);
or
DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE)>0;
It is highly unlikely that many posters on the Community know anything about ICD-10-CM. I don't. If you could explain what the 9 diagnosis codes are, and some more example data with expected outcomes then you will have a better chance of getting suitable answers.
thank you.
data contain
ID diagnosis1 diagnosis2 diagnosis3 diagnosis4 diagnosis5 diagnosis6 diagnosis7 diagnosis8 diagnosis9
1 T12340 T1235 S12400 S12340 T123 T1256 S12345 S13456 T567
The definition for disease is
((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)
as long as one of the diagnosis meets the definition, the case is disease=1
As @SASKiwi already said it is hard to build a regex if you don't know what it is exactly you are looking for (what are the boundaries of the codes and the general structure).
But even better then answering try to build it by yourself. Developing the pattern with this site is super intuitive: https://regexr.com/38ed7
If you are not familiar with the basic concepts of regex I recommend this blog post: https://www.janmeppe.com/blog/regex-for-noobs/
In SAS the function you would want to use is the prxmatch function(https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n0bj9p4401w3n9n1gmv6tf**bleep**9...).
Also a great resource about RegEx in SAS is the regex tip sheet by the SAS Support Team: https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf
I hope this helps you!
thanks for the resources. very helpful.
If the expression you want to use is the one you gave, just write:
DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE);
or
DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE)>0;
thank you!
here is the codes in case someone else needs it.
data want;
set datahave;
array injurydx[9] $7 DIAG1-DIAG9;/*diagnosis 1-9 variables*/
do i = 1 to 9;
DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',injurydx[i] );
end;
drop i;
run;
@xinyao2019 wrote:
how to write regular expression to identify cases coded using ICD-10-CM?
there are 9 diagnosis codes, as long as one diagnosis code meets the definition, then disease=1
((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)
thanks
I know just enough about ICD-10 codes to be dangerous.
One of the recurring issues on this forum regarding ICD-10 and ICD-9 coding is that different organizations implement the coding slightly differently for the values with some using periods, other using _ instead of periods and I believe that we have at least one organization that uses custom informat / format pairs to create numeric ICD variables so some specific sort orders are used.
So you should likely show some of the entire values that you are searching for.
Another approach that has been used is custom format for specific disease which allows such thing as
If put(icdvar,$customformatname.) = '1' then ...
moving the logic out to proc format.
If by any chance you actually have all of the codes in a data set it is very easy to create a CNTLIN data set for Proc format to create the needed format.
One of the concerns I have with Regex and ICD-10 is there are so many levels for some of the code groups and interpreting some of those expressions, as you are finding, may not be particularly easy to catch all of the cases involved.
i did not know that different organizations implement ICD-10-CM slightly different.
we have the ICD-10-CM in our dataset as Character
here is how it looks:
Obs DIAG1
1 I130
2 J189
3 S42201A
4 A047
5 J440
6 K5720
7 J189
8 O701
9 Z3800
10 Z3800
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.