I'm working with medical claims data. For each healthcare encounter, up to 25 different diagnosis codes can be present. There are about 30 different health conditions that I need to flag where ever they show up within the data (it doesn't matter which of the 25 diagnosis variables it matches). Each health condition has several (usually 3-7) different diagnosis codes. The list of health condition codes were given to me in an Excel file with a separate tab for each health condition of interest. I then imported each of those tabs as a separate SAS file.
In the example file below, I only added 5 diagnosis variables and 2 health conditions for the sake of clarity. The initial data file doesn't have the flag variables, and the output file should have the flag variables.
HEALTHCARE_ENCOUNTER_ID | icd_dx1 | icd_dx2 | icd_dx3 | icd_dx4 | icd_dx5 | CANCER_FLAG | PREGNANCY_FLAG |
12345 | 9993 | 2255 | 3366 | 4477 | 5588 | 1 | 0 |
23456 | 2358 | 5899 | 8881 | 7777 | 8850 | 0 | 1 |
34567 | 8546 | 5468 | 4811 | 8883 | 9991 | 1 | 1 |
45678 | 8326 | . | . | . | . | 0 | 0 |
CODE | DESCRIPTION |
9991 | Cervical cancer |
9992 | Pelvic cancer |
9993 | Breast cancer |
CODE | DESCRIPTION |
8881 | Normal pregnancy |
8882 | Atopic pregnancy |
8883 | Preclampsia |
I haven't gotten very far with my syntax. What I have so far attempts to compare the range of diagnosis codes against the list of codes for each condition. This is within a macro (where &sheet. represents the condition name), so it will create a separate file for each condition, and then I will have to merge all the files together (along with all healthcare episodes that didn't match on any of the conditions). It's clunky and I'm not sure if a range will work in a Proc SQL step:
PROC SQL;
Create table temp.flag_&sheet.
as Select *
From temp.healthcare_encounters
WHERE icd_dx1-25 in (select HEDIS_DX_code from temp.Hyst_Codes_&sheet.);
QUIT;
In the syntax above, "HEDIS_DX_code" should have been changed to "Code".
Hint: Provide some example data in the form of a data step that behaves like yours as far as mixes of values and all the rules about what is kept. The data step so we don't have to guess about variable types.
Then work through that (should be a small enough set to do manually) and show use the desired result.
Note: multiple similar value like your DX variables and applying similar rules to each variable typically points to an ARRAY based solution which means Data step as SQL doesn't provide any short cut ways to handle list of variables.
Turning the Excel lists into Formats would be one way approach something like this.
data have; input HEALTHCARE_ENCOUNTER_ID $ icd_dx1 icd_dx2 icd_dx3 icd_dx4 icd_dx5; datalines; 12345 9993 2255 3366 4477 5588 23456 2358 5899 8881 7777 8850 34567 8546 5468 4811 8883 9991 45678 8326 . . . . 0 0 ; proc format library=work; value iscancer 9991, 9992, 9993='Is Cancer' other= 'Not Cancer' ; value ispregnant 8881, 8882,8883 = 'Is Pregnant' other= 'Not pregnant' ; run; data want; set have; array dx (*) Icd_dx: ; do i=1 to dim(dx); IsCancer = max(IsCancer, put(dx[i],iscancer.)='Is Cancer'); IsPregnant = max(IsPregnant, put(dx[i],IsPregnant.)='Is Pregnant'); end; drop i; run;
Datasets can create formats if the list of values and/or number of lists is "large" and don't want to type out all the formats.
The biggest trick involved once you have the formats is the Max() statement. SAS will return 1 for true or 0 for false for the "put(variable,format.)="some text") result. The max means that largest value created gets kept as the variables get parsed.
Note that the only thing need to process more variables is to have them on the ARRAY statement. So if you have a hundred variables named with the Icd_dx convention they all get processed with the Icd_dx: list. All variable starting with Icd_dx would be included. Or if you want a subset of them Icd_dx1 - Icd_dx30 for example. These variable lists coupled with the Array are why I would not attempt to use SQL.
If you place the format into a permanent library, possibly where you are working with this data and that that library to the FMTSEARCH path then the formats are permanent and you don't need to run the code every time to use them.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.