Hello,
I am stuck with categorizing patients based on their first diagnosis. Here is some background information about my project.
I am interested in long-term risk of cardiovascular diseases (CVDs) among women.
I will utilize 4 data sources to identify women with CVD diagnosis: Emergency department, Hospital discharge, Death certificate, and Medical claims. I pulled ICD codes and service dates from each data source and flagged them. For example, flag_isch_dc indicates an ischemic heart disease diagnosis from the death certificate while flag_isch_hds indicates an ischemic heart disease diagnosis from the hospital discharge data.
The data is at the claims level, and I eventually want to bring it to the patient level. Before doing so, I will need to create a single CVD category. Given that I am using 4 different data sources and 4 different CVD diagnosis flags and dates, I am stuck with correctly ordering them because some women have diagnoses captured in one data but not in another.
The patient level id is Row_id2. However, if I remove duplicates, I am losing a lot of women with their diagnosis.
So, I want to correctly categorize them without losing any participants.
I initially wanted to categorize women whichever the earliest diagnosis was. And I want to keep that date for the first diagnosis as I will use Cox-proportional hazard regression.
To simplify further, I have 4 different flags with 4 different dates for ischemic heart disease. Example:
Date of Diagnosis | Ischemic | Cerebrovascular | Hypertension | Other heart | Other CVD | Final CVD grouping | Date of diagnosis from final CVD grouping | |
Medical Claims | MC059_MC | flag_isch_mc | flag_cero_mc | flag_hypt_mc | flag_ohrt_mc | flag_ocvd_mc | ||
Emergency Department | Service_from_Dt_ER | flag_isch_er | flag_cero_er | flag_hypt_er | flag_ohrt_er | flag_ocvd_er | ||
HDS | Service_from_Dt_HDS | flag_isch_hds | flag_cero_hds | flag_hypt_hds | flag_ohrt_hds | flag_ocvd_hds | ||
Death Certificate | death_date_DC | flag_isch__dc | flag_cero__dc | flag_hypt_dc | flag_ohrt_dc | flag_ocvd_dc |
I want to group them (Final CVD grouping) in a way that woman will be classified based on her earliest diagnosis in these 4 files. Then, I should be able to bring it to patient-level data.
I attached a SAS file with 60 observations with duplicates.
I greatly appreciate it if you have any suggestions on how to achieve the correct classification and then bring it back to patient-level data.
Thank you for your time and help.
That means the reshaping should only include output when your diagnosis occurs. Again as I said, your example data doesn't show any so it hard to deal with.
That being the case I would probably reshape to include a separate field for diagnosis and only keep those. The dates and type of diagnosis would be relatively easy to deal with. But the rules would need to be expanded upon.
Something like this would have ONLY the diagnosis, the TYPE variable holding the information as to which.
Then a sort by Id date type would have the positive diagnosis in order.
data reshape; set dl.sas_try; array d (*)mc059_mc death_date_dc service_from_date_er service_from_date_hds; array fi (*) flag_isch_mc flag_isch_dc flag_isch_er flag_isch_hds; array fc (*) flag_cero_mc flag_cero_dc flag_cero_er flag_cero_hds; array fh (*) flag_hypt_mc flag_hypt_dc flag_hypt_er flag_hypt_hds; array fo (*) flag_ohrt_mc flag_ohrt_dc flag_ohrt_er flag_ohrt_hds; array fv (*) flag_ocvd_mc flag_ocvd_dc flag_ocvd_er flag_ocvd_hds; array s (4) $3 _temporary_ ('MC','DC','ER','HDS'); do i=1 to dim(d); source = s[i]; date = input(d[i],yymmdd10.); flag_isch =fi[i]; flag_cero =fc[i]; flag_hypt =fh[i]; flag_ohrt =fo[i]; flag_ocvd =fv[i]; if not missing (date) then do; /* dummy logic as not clear from your data*/ /* if f[i] indicates diagnosis then do; type='ISCH'; output; end; /* /* repead for each flag*/ end; end; format date yymmdd10.; keep Row_id2 source date type; flag_ohrt flag_ocvd; run;
Depending on your complete set of rules the reshaped data may need more manipulation.
Part of the moral of this story is example data should actually be representative and have the information to actually apply any algorithm for creating other values.
First thing I would do is make sure the DATE variables are actual SAS dates. While you can sort character values in the form of yyyy-mm-dd and get an order it is very hard to actually compare such values.
Your attached example data has no values of any of the flag other than 0 or missing as in all values of each flag have exactly one value either 0 or missing. So it is going to be very hard to do anything with that example.
You also do not describe in any way how you expect to assign a single CVD category.
Are you wanting the first date of diagnosis by data source or by diagnosis?
This is how I would reshape the data get all the flags by data source:
data reshape; set dl.sas_try; array d (*)mc059_mc death_date_dc service_from_date_er service_from_date_hds; array fi (*) flag_isch_mc flag_isch_dc flag_isch_er flag_isch_hds; array fc (*) flag_cero_mc flag_cero_dc flag_cero_er flag_cero_hds; array fh (*) flag_hypt_mc flag_hypt_dc flag_hypt_er flag_hypt_hds; array fo (*) flag_ohrt_mc flag_ohrt_dc flag_ohrt_er flag_ohrt_hds; array fv (*) flag_ocvd_mc flag_ocvd_dc flag_ocvd_er flag_ocvd_hds; array s (4) $3 _temporary_ ('MC','DC','ER','HDS'); do i=1 to dim(d); source = s[i]; date = input(d[i],yymmdd10.); flag_isch =fi[i]; flag_cero =fc[i]; flag_hypt =fh[i]; flag_ohrt =fo[i]; flag_ocvd =fv[i]; if not missing (date) then output; end; format date yymmdd10.; keep Row_id2 source date flag_isch flag_cero flag_hypt flag_ohrt flag_ocvd; run;
and to get the first by the source:
Proc sort data=reshape; by Row_id2 source date; run; data maybe; set reshape; by Row_id2 source date; if first.source; run;
I suspect that you may actually need a different reshaping and only including in the output something where one or more of those flags indicate a diagnosis but your example data, since none of the flag values vary, doesn't have anything that seems likely.
Thank you for your prompt response @ballardw .
Unfortunately, CVD (both overall and subgroups) is rare in my study population, which is why I have so many missing and 0's.
I want to categorize them by their earliest diagnosis. Let's say a woman appears to have a flag for ischemic heart disease (flag_isch_ER) in emergency department data (ER) and also has flag for cerebrovascular disease (flag_cero_hds) in hospital discharge data (HDS). I want this woman to be categorized based on "the first date of diagnosis by diagnosis". So, if the first date of diagnosis for "flag_isch_ER" is earlier than "flag_cero_hds" then I want to have her to be categorized under "ischemic heart disease group."
I am trying the code right now and will update you shortly.
That means the reshaping should only include output when your diagnosis occurs. Again as I said, your example data doesn't show any so it hard to deal with.
That being the case I would probably reshape to include a separate field for diagnosis and only keep those. The dates and type of diagnosis would be relatively easy to deal with. But the rules would need to be expanded upon.
Something like this would have ONLY the diagnosis, the TYPE variable holding the information as to which.
Then a sort by Id date type would have the positive diagnosis in order.
data reshape; set dl.sas_try; array d (*)mc059_mc death_date_dc service_from_date_er service_from_date_hds; array fi (*) flag_isch_mc flag_isch_dc flag_isch_er flag_isch_hds; array fc (*) flag_cero_mc flag_cero_dc flag_cero_er flag_cero_hds; array fh (*) flag_hypt_mc flag_hypt_dc flag_hypt_er flag_hypt_hds; array fo (*) flag_ohrt_mc flag_ohrt_dc flag_ohrt_er flag_ohrt_hds; array fv (*) flag_ocvd_mc flag_ocvd_dc flag_ocvd_er flag_ocvd_hds; array s (4) $3 _temporary_ ('MC','DC','ER','HDS'); do i=1 to dim(d); source = s[i]; date = input(d[i],yymmdd10.); flag_isch =fi[i]; flag_cero =fc[i]; flag_hypt =fh[i]; flag_ohrt =fo[i]; flag_ocvd =fv[i]; if not missing (date) then do; /* dummy logic as not clear from your data*/ /* if f[i] indicates diagnosis then do; type='ISCH'; output; end; /* /* repead for each flag*/ end; end; format date yymmdd10.; keep Row_id2 source date type; flag_ohrt flag_ocvd; run;
Depending on your complete set of rules the reshaped data may need more manipulation.
Part of the moral of this story is example data should actually be representative and have the information to actually apply any algorithm for creating other values.
Thank you. The latter code was helpful.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.