BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
xinyao2019
Calcite | Level 5

how to write regular expression to identify cases coded using ICD-10-CM? 

there are 9 diagnosis codes, as long as one diagnosis code meets the definition, then disease=1

((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)

 

thanks 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

If the expression you want to use is the one you gave, just write:

 

DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE);

or

 

DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE)>0;

 

 

View solution in original post

8 REPLIES 8
SASKiwi
PROC Star

It is highly unlikely that many posters on the Community know anything about ICD-10-CM. I don't. If you could explain what the 9 diagnosis codes are, and some more example data with expected outcomes then you will have a better chance of getting suitable answers.

xinyao2019
Calcite | Level 5

thank you.

data contain

ID diagnosis1 diagnosis2 diagnosis3 diagnosis4 diagnosis5 diagnosis6 diagnosis7 diagnosis8 diagnosis9

1     T12340    T1235       S12400        S12340      T123          T1256        S12345      S13456      T567

 

The definition for disease is 

((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)

as long as one of the diagnosis meets the definition, the case is disease=1 

 

 

Criptic
Lapis Lazuli | Level 10

As @SASKiwi  already said it is hard to build a regex if you don't know what it is exactly you are looking for (what are the boundaries of the codes and the general structure).

 

But even better then answering try to build it by yourself. Developing the pattern with this site is super intuitive: https://regexr.com/38ed7

 

If you are not familiar with the basic concepts of regex I recommend this blog post: https://www.janmeppe.com/blog/regex-for-noobs/

 

In SAS the function you would want to use is the prxmatch function(https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n0bj9p4401w3n9n1gmv6tf**bleep**9...).

 

Also a great resource about RegEx in SAS is the regex tip sheet by the SAS Support Team: https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

 

I hope this helps you!

xinyao2019
Calcite | Level 5

thanks for the resources. very helpful. 

ChrisNZ
Tourmaline | Level 20

If the expression you want to use is the one you gave, just write:

 

DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE);

or

 

DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',DIAGNOSIS_CODE)>0;

 

 

xinyao2019
Calcite | Level 5

thank you! 

here is the codes in case someone else needs it.

data want;

set datahave;

array injurydx[9] $7 DIAG1-DIAG9;/*diagnosis 1-9 variables*/
do i = 1 to 9;
DISEASE=prxmatch('/((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)/',injurydx[i] );
end;
drop i;
run;

ballardw
Super User

@xinyao2019 wrote:

how to write regular expression to identify cases coded using ICD-10-CM? 

there are 9 diagnosis codes, as long as one diagnosis code meets the definition, then disease=1

((T3[679]9|T414|T427|T4[3579]9)[1-4].|(?!(T3[679]9|T414|T427|T4[3579]9))(T3[6-9]|T4[0-9]|T50)..[1-4])(A|$|\b)

 

thanks 


I know just enough about ICD-10 codes to be dangerous.

 

One of the recurring issues on this forum regarding ICD-10 and ICD-9 coding is that different organizations implement the coding slightly differently for the values with some using periods, other using _ instead of periods and I believe that we have at least one organization that uses custom informat / format pairs to create numeric ICD variables so some specific sort orders are used.

So you should likely show some of the entire values that you are searching for.

 

Another approach that has been used is custom format for specific disease which allows such thing as

 

If put(icdvar,$customformatname.) = '1' then ...

moving the logic out to proc format.

 

If by any chance you actually have all of the codes in a data set it is very easy to create a CNTLIN data set for Proc format to create the needed format.

 

One of the concerns I have with Regex and ICD-10 is there are so many levels for some of the code groups and interpreting some of those expressions, as you are finding, may not be particularly easy to catch all of the cases involved.

xinyao2019
Calcite | Level 5

i did not know that different organizations implement  ICD-10-CM slightly different. 

we have the ICD-10-CM in our dataset as Character 

here is how it looks:

Obs DIAG1

1 I130

2 J189

3 S42201A

4 A047

5 J440

6 K5720

7 J189

8 O701

9 Z3800

10 Z3800

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2339 views
  • 3 likes
  • 5 in conversation