you are right.
However, I need to modify my variable from the original dataset, so those values like
a287.8 a286.3 a298.90
will be truncated as
a28 a28 a29
therefore I have duplicate records
you see, if I do not truncate, I cannot try your solution, since the values are not exactly what the reference tables are, for example, the reference list only list
a287, a287.0
but a287.8 should be of course considered as icd10 as well
and some are like
v23.34 v1.34
so i decide only to get the first three then match
Why can't you remove the duplicates?
If you truncate to 3 chars that's a category, if you try and merge on that you'll have multiple matches and your table will have multiple matches. Also, you could have incorrect matches if the values after aren't valid.
Why not expand to the full list of reference values. I'm sure there's a list on CMS website.
I'm fairly certain that most DBs remove the decimal and list the values as A2878.
I think @Reeza is right that a PROC FORMAT approach will be easiest. Formats allow simple look-ups like you need. And if you are working with ICD9 codes, you probably already have formats that labels them all (or if you don't, it's worth building the formats).
Below is a small example of using formats to look up each code in HAVE to assign it a label per ICD9, and a label per ICD10. Of course you can add logic to catch codes that have both ICD9 and ICD10 labels, or neither. But it's a starting point.
If you need to build the formats from existing data, then PROC FORMAT CNTLIN is the way to go (not shown below).
data have; input id $1. icd : $3.; cards; 1 E01 2 E02 3 E03 4 E33 5 E99 ; run; proc format; value $ICDnine "E33"="eye dise" "E03"="unspecific dis" other=" " ; value $ICDten "E01"="Iodine-deficiency related thyroid disorders and allied conditions" "E02"="Subclinical iodine-deficiency hypothyroidism" "E03"="Other hypothyroidism" other=" " ; run; data want; set have; length ICD9text $100 ICD10text $100; ICD9text=put(icd,$ICDnine.); ICD10text=put(icd,$ICDten.); put (icd icd9text icd10text)(=); run;
Returns in log:
199 data want; 200 set have; 201 length ICD9text $100 ICD10text $100; 202 ICD9text=put(icd,$ICDnine.); 203 ICD10text=put(icd,$ICDten.); 204 put (icd icd9text icd10text)(=); 205 run; icd=E01 ICD9text= ICD10text=Iodine-deficiency related thyroid disorders and allied conditions icd=E02 ICD9text= ICD10text=Subclinical iodine-deficiency hypothyroidism icd=E03 ICD9text=unspecific dis ICD10text=Other hypothyroidism icd=E33 ICD9text=eye dise ICD10text= icd=E99 ICD9text= ICD10text= NOTE: There were 5 observations read from the data set WORK.HAVE. NOTE: The data set WORK.WANT has 5 observations and 4 variables.
thank you
my original dataset contains values like A2.89, K299.47, 034.568,
They do not quite match the reference dataset, which already remove "."
i need some more flexible ways, which means, do I have to modify the value of reference dataset and my have dataset first, then use CNTLIN?
Yes, for the PROC FORMAT approach (and hash approach) the values in your HAVE data must match the values defined in the format (or hash table key) exactly. So you could clean the values to make them consistent. Or if you want FUZZY matching, would need a different approach.
@Bal23 wrote:
thank you
my original dataset contains values like A2.89, K299.47, 034.568,
They do not quite match the reference dataset, which already remove "."
i need some more flexible ways, which means, do I have to modify the value of reference dataset and my have dataset first, then use CNTLIN?
I do a very small amount of work with ICD codes but your comment about "already remove "."" makes me very concerned as those decimals are very significant and a remember some where it is possibly to have A2.8 and A28. type structures. I would question the decision about creating any type of reference table without the decimals.
And for formats I might also recommend, though it is more work, to investigate MULTILABEL formats. When used with the procedures that support them it makes reports of the form "I want all injuries to feet with a total AND the break out by subcategories of foot injuries as well" easier.
thanks
your are right
for the current project, i only need to tell whether it is icd9 or 10, so A2.8, OR A28.9 are both icd-10, therefore this is not my concern for this
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.