I'm working with medical claims data, and I'm trying to create a file that will show the medical histories for a select group of kids. For each claim date for each kid, I need a list of diagnosis, procedure, and revenue codes. But many of the records do not include any new information. In each of the 2 examples below, the middle line contains information that is already present in the other lines. So the middle line is redundant and could be deleted.
encrypted_id | DATE_OF_BIRTH | GENDER | CLAIM_SERVICE_FROM_DATE | DIAGNOSIS_CODE_t | PRCDR_CODE | REVENUE_CODE |
02D957E3EDF5 | 23-Aug-05 | F | 26-Apr-16 | H52223 | V2715 | 0611 |
02D957E3EDF5 | 23-Aug-05 | F | 26-Apr-16 | H5203 | V2715 | |
02D957E3EDF5 | 23-Aug-05 | F | 26-Apr-16 | H5203 | V2020 | |
encrypted_id | DATE_OF_BIRTH | GENDER | CLAIM_SERVICE_FROM_DATE | DIAGNOSIS_CODE_t | PRCDR_CODE | REVENUE_CODE |
14A17629B53944 | 31-Jul-07 | F | 20-Apr-17 | R480 | 99174 | 0610 |
14A17629B53944 | 31-Jul-07 | F | 20-Apr-17 | Z6852 | 99174 | |
14A17629B53944 | 31-Jul-07 | F | 20-Apr-17 | Z6852 | 99214 |
Based on an approach I found in another post, I tried the code below. But for some reason, the rev_new variable contained diagnosis codes and DX_new contained rev codes. And there was still redundant information in the output file.
DATA temp.numer_ID_codesort; SET temp.numer_ID_list_dd;
LENGTH DX_new proc_new rev_new $10;
DX_new = DIAGNOSIS_CODE_t;
proc_new = PRCDR_CODE;
rev_new = REVENUE_CODE;
CALL SORTC(rev_new,proc_new,DX_new);
RUN;
PROC SORT DATA = temp.numer_ID_codesort OUT = temp.numer_ID_final nodupkey;
BY DX_new proc_new rev_new;
RUN;
In your sample data, it won't actually be a single line, since your first set has 2 values for diagnosis, 2 for prcdr, and 1 for revenue. So it should have 2 lines. The first line would have one of the diagnosis codes, one of the prcdr codes, and the only value for revenue code. The second would have the alternative value for diagnosis code, the alternative for prcdr code, and a missing value for revenue.
proc freq data=have noprint;
tables encrypted_id*diagnosis_code_t / out=freqs1 (drop=percent count where=(diagnosis_code_t^=.));
tables encrypted_id*prcdr_code / out=freqs2 (drop=percent count where=(prcdr_code^=.));
tables encrypted_id*revenue_code / out=freqs3 (drop=percent count where=(revenue_code^=.));
run;
data want;
merge have (drop=diagnosis_code_t prcdr_code revenue_code)
freqs1 (in=in1) freqs2 (in=in2) freqs3 (in=in3);
by id;
if max(in1,in2,in3)=1 ;
output;
call missing (diagnosis_code_t,prcdr_code,revenue_code);
run;
True, this program requires 2 passes through the data set, but it has a very simple structure. One could do it in one pass, using hash objects, but that would be a good deal more programming. I wouldn't consider the hash object approach unless the data set was very large, and not sorted.
So, the only variables of interest are the three last variables? And the middle observation is redundant because obs 1 and 3 combined provides the variable values of the second observation, correct?
I need all of the variables in the example records.
Obs 1 and 3 combined provides the values of Obs 2 in these particular examples. But the redundant records are not always Obs 2. Also, there may be any number of records for a particular kid on a particular date.
Conceptually, one potential solution would be to have SAS group all records that have the same ID and claim date. Then, within each of those groups, have SAS look at the DX codes and assign a flag variable. If a given code appears only once, it would be flagged as 0. Other codes appearing more than once would be flagged as 1. Repeat this process for proc and rev codes.
Then an IF statement could use those flags to delete records that don't contain any new information:
IF DX_flag > 0 AND Proc_flag > 0 AND Rev_flag > 0 THEN delete;
So that's the idea... can anyone help me figure out how to actually program that?
You want to remove any observation that is a subset of the union of all the other observations. So which one do you remove if you data are:
A B C
1 11 .
1 . 111
. 11 111
Any of these three observations could be removed. Which one does your rule require?
And also note this has to be a recursive process. Because if you find two observations that qualify for deletion, once you remove the first one, you have to recheck that the other can still be removed.
Perhaps what you should do instead is to consolidate records = i.e. synthetically create a minimal set of records that contain the entire collection of observed values. For instance, from the above 3 records, instead of deleting one of them (leaving 2 records), make a single new record with
A B C
1 11 111
even though this record doesn't exist in your original data. And of course the same principle can be applied if some variables have multiple values.
I see your point. If I apply my rule to your example, it won't delete any records. I certainly like the idea of consolidating your example to a single line... now how do I actually program that?
In your sample data, it won't actually be a single line, since your first set has 2 values for diagnosis, 2 for prcdr, and 1 for revenue. So it should have 2 lines. The first line would have one of the diagnosis codes, one of the prcdr codes, and the only value for revenue code. The second would have the alternative value for diagnosis code, the alternative for prcdr code, and a missing value for revenue.
proc freq data=have noprint;
tables encrypted_id*diagnosis_code_t / out=freqs1 (drop=percent count where=(diagnosis_code_t^=.));
tables encrypted_id*prcdr_code / out=freqs2 (drop=percent count where=(prcdr_code^=.));
tables encrypted_id*revenue_code / out=freqs3 (drop=percent count where=(revenue_code^=.));
run;
data want;
merge have (drop=diagnosis_code_t prcdr_code revenue_code)
freqs1 (in=in1) freqs2 (in=in2) freqs3 (in=in3);
by id;
if max(in1,in2,in3)=1 ;
output;
call missing (diagnosis_code_t,prcdr_code,revenue_code);
run;
True, this program requires 2 passes through the data set, but it has a very simple structure. One could do it in one pass, using hash objects, but that would be a good deal more programming. I wouldn't consider the hash object approach unless the data set was very large, and not sorted.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.