Hi everyone,
I have a programing question regarding how to identify a condition by two claims and capture the earlest date of diagnosis here. I'm hopeing someone could help me come up with an efficient way of programming to achive the goal, because i have 21 conditions need to be captured using the same claim data.
In my dataset (please see the attachement for example), each patient might have different number of claims and each row represents a single claim with a claim id. I need to use the diagnosis code, dx1-dx12 to identify whether a patient has, for example, hypertension (if any of the dx = : ('401', '402', '403') ). If the patient has hypertension, then i need to pick up the earlest date of diagnosis, using the corresponding claim date. In the final dataset, each patient will be flaged as hypertension (0/1), depression(0/1), and so forth, with the earlest date of diagnosis for each, if they have the target condition.
I will be greatly appreciated if someone could share your thoughts on how to program this efficiently.
Thank you!
Someone in this wonderful SAS community can easily help you, If you show a sample have and want datasets.
Many users here don't want to download Word files because of virus potential, others have such things blocked by security software. Also if you give us Word we have to create a SAS data set and due to the non-existent constraints on data cells the result we end up with may not have variables of the same type (numeric or character) and even values.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the {i} icon or attached as text to show exactly what you have and that we can test code against. Desired output datasets should be provided in a similar fashion.
data TEMP.SAMPLE1;
infile datalines dsd truncover;
input ICD_DGNS_CD1:$7. ICD_DGNS_CD2:$7. ICD_DGNS_CD3:$7. ICD_DGNS_CD4:$7. ICD_DGNS_CD5:$7. ICD_DGNS_CD6:$7. ICD_DGNS_CD7:$7. ICD_DGNS_CD8:$7. ICD_DGNS_CD9:$7. ICD_DGNS_CD10:$7. ICD_DGNS_CD11:$7. ICD_DGNS_CD12:$7. claim_dt:YYMMDDN8. id:32. clm_id:32.;
datalines4;
7812,53081,7243,4011,,,,,,,,,20111226,1,50
7812,53081,7243,4011,,,,,,,,,20111226,1,51
7812,53081,7243,4011,,,,,,,,,20111226,1,52
7812,53081,7243,4011,,,,,,,,,20111226,1,54
;;;;
You can try something like this:
1. convert your array to a long form (one obs. for each diagnosis and claim):
data long; set have; array diagnoses dx1-dx12; do _N_=1 to dim(diagnoses); if missing(diagnoses(_N_)) then continue; dx=diagnoses(_N_); output; end; drop dx1-dx12; run;
If the diagnoses are always "left-aligned" (meaning that once you find a missing diagnosis, the rest of the array will be missing too), you can safely replace the CONTINUE statement with a LEAVE statement, getting some improvement in performance.
2. Sort the data and find the earliest date for each diagnosis by patient:
proc sort data=long; by patient_id dx claim_date; run; data want; set long; by patient_id dx; if first.dx; run;
Thank you for sharing your valuable thoughts. I like your idea of improving efficiency by using long data instead. I'd like to appliy your logic to my programming when a target condition is defined by capturing any of the specified dignosis code (icd9) in one claim. But in this particular project, I would need to have at least two claims with any of the target icd9 code, let's say, any icd9 code starts with '401','402','403', '404', '405'. The same rule applies to the definition of the rest of conditions of interest. The dataset that i have is at the claim level, meaning each patient could have multiple claims. The dataset that i want will be at the patient level, with an indicator of each condition of interst, along with the earliest indication of dignosis data (which is the claim data in the original dataset). Thank you for your help, which is greatly appreciated!
You could use a format to group your codes, like
proc format; value $diaggrp '401','402','403', '404', '405'='Severe dyslexia' '102','103'='Constipation' '803','335'='Uncommon cold' ; run; data long; set have; array diagnoses dx1-dx12; do _N_=1 to dim(diagnoses); if missing(diagnoses(_N_)) then continue; dx=diagnoses(_N_); group=put(dx,$diaggrp.); output; end; drop dx1-dx12; run;
Then, use the groups instead of the original diagnosis numbers. To make sure that you have two claims in a group, just check that the observation is not the last of its kind:
proc sort data=long nodupkey;
by patient_id group claim_id;
run;
proc sort data=long; by patient_id group claim_date; run; data want; set long; by patient_id group; if first.group and not last.group; run;
I put the first proc sort with the NODUPKEY in to make sure that if you had two DX values in the same group on the same claim, that would only count as one claim of this type. The second sort may be dropped if you are sure that CLAIM_ID values are always in chronological order.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.