Hello,
Rationale: searching through patient records for specific ICD-9 codes (char) using prxmatch() and flagging for disease condition. Data looks like this:
id | icd9_1 | icd9_2 | icd9_3 |
---|---|---|---|
1 | 153.4 | 285.1 | 427.31 |
2 | 578.1 | 584.5 | 570 |
Each observation has a different number of variables (icd9_{i}) the max is icd9_39. What I want is the following:
id | icd9_1... | ...icd9_39 | Cancer | AIDS |
---|---|---|---|---|
1 | 153.4 | 1 | 0 | |
2 | 578.1 | 276.2 | 0 | 0 |
My Code:
data want
array icd9_{39} $ icd9_1-icd9_39;
cancer = 0;
do i = 1 to 39;
if prxmatch("/140/", icd9_{i}) or prxmatch("/141/", icd9_{i}) or ...
then cancer = 1;
end;
aids = 0;
do i = 1 to 39;
if prxmatch("/042/", icd9_{i}) or prxmatch("/043/", icd9_{i}) or ...
then aids = 1;
end;
set have;
run;
This works perfectly for me except that it skips the first observation and puts the the icd9 codes from the first observation into the second observations disease variables then the 2nd into the 3rd and the 3rd into the 4th and so on.
Question is: How do I fix this? I tried messing with my "i" value and haven't found a solution. I'll keep tinkering.
Have you tried moving your set statement to the top, right after the data statement.
Have you tried moving your set statement to the top, right after the data statement.
Nevermind everyone, I figured it out. Apparently if you put your set statement at the end as opposed to right below the data step you get this problem I've described. If anyone can explain why that is I'd be grateful.
You'll have to switch gears related to how you think about the SET statement. It is not just a label that tells you where the data comes from. It is an executable statement. During the course of a DATA step, it executes many times and each time it reads in the next observation from the incoming SAS data set.
In that light, consider what happens on your first observation (in your original, uncorrected DATA step). You calculate AIDS and CANCER, then read in the first observation from the incoming data, and finally output the result. So the calculated values will be 0. Then the DATA step continues. It calculates AIDS and CANCER based on the current data values (which came from the first observation, and are retained in memory). Then it reads in the second observation, and outputs the final result. So AIDS and CANCER are, as you observed, based on the first observation but the final data values come from the second observation.
It's a medium-complex process, and there are a host of related topics. For example, you could scour the documentation to study the difference between the compilation phase vs. the execution phase of the DATA step. But the description above is probably the most relevant to understanding the results you saw.
Good luck.
Thank You for Answering. I forget these details at times.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.