Solved: Eliminate redunant information across 3 variables

Wolverine · Posted 01-15-2019 02:14 PM

I'm working with medical claims data, and I'm trying to create a file that will show the medical histories for a select group of kids. For each claim date for each kid, I need a list of diagnosis, procedure, and revenue codes. But many of the records do not include any new information. In each of the 2 examples below, the middle line contains information that is already present in the other lines. So the middle line is redundant and could be deleted.

encrypted_id	DATE_OF_BIRTH	GENDER	CLAIM_SERVICE_FROM_DATE	DIAGNOSIS_CODE_t	PRCDR_CODE	REVENUE_CODE
02D957E3EDF5	23-Aug-05	F	26-Apr-16	H52223	V2715	0611
02D957E3EDF5	23-Aug-05	F	26-Apr-16	H5203	V2715
02D957E3EDF5	23-Aug-05	F	26-Apr-16	H5203	V2020

encrypted_id	DATE_OF_BIRTH	GENDER	CLAIM_SERVICE_FROM_DATE	DIAGNOSIS_CODE_t	PRCDR_CODE	REVENUE_CODE
14A17629B53944	31-Jul-07	F	20-Apr-17	R480	99174	0610
14A17629B53944	31-Jul-07	F	20-Apr-17	Z6852	99174
14A17629B53944	31-Jul-07	F	20-Apr-17	Z6852	99214

Based on an approach I found in another post, I tried the code below. But for some reason, the rev_new variable contained diagnosis codes and DX_new contained rev codes. And there was still redundant information in the output file.

DATA temp.numer_ID_codesort; SET temp.numer_ID_list_dd;

LENGTH DX_new proc_new rev_new $10;

DX_new = DIAGNOSIS_CODE_t;
proc_new = PRCDR_CODE;
rev_new = REVENUE_CODE;

CALL SORTC(rev_new,proc_new,DX_new);
RUN;

PROC SORT DATA = temp.numer_ID_codesort OUT = temp.numer_ID_final nodupkey;
BY DX_new proc_new rev_new;
RUN;

mkeintz · Posted 01-16-2019 11:17 AM

In your sample data, it won't actually be a single line, since your first set has 2 values for diagnosis, 2 for prcdr, and 1 for revenue. So it should have 2 lines. The first line would have one of the diagnosis codes, one of the prcdr codes, and the only value for revenue code. The second would have the alternative value for diagnosis code, the alternative for prcdr code, and a missing value for revenue.

proc freq data=have noprint;
  tables encrypted_id*diagnosis_code_t / out=freqs1 (drop=percent count where=(diagnosis_code_t^=.));
  tables encrypted_id*prcdr_code       / out=freqs2 (drop=percent count where=(prcdr_code^=.));
  tables encrypted_id*revenue_code     / out=freqs3 (drop=percent count where=(revenue_code^=.));
run;
data want;
  merge have (drop=diagnosis_code_t prcdr_code revenue_code) 
        freqs1 (in=in1)  freqs2 (in=in2) freqs3 (in=in3);
  by id;
  if max(in1,in2,in3)=1 ;
  output;
  call missing (diagnosis_code_t,prcdr_code,revenue_code);
run;

The proc freq tabulates the cross-tabulation of ID by each of the variables of interest and outputs the frequencies to FREQS1, FREQS2, and FREQS3, respectively. Ordinarily the output dataset would have 4 variables: the ID var, the var of interest, the variable COUNT (actually frequency of the ID*var-of-interest), and PERCENT. WE don't need the last 2, so they are dropped, leaving only ID and var-of-interest.
1. The FREQS files will be ordered by ID and the var-of-interest.
2. The where-filter eliminates frequencies of missing values.
The DATA WANT step
1. Use the "match merge", which requires all data sets to be sorted by id.
2. There may be 3 records in have for a given ID, even though none of the FREQS data sets has more than 2. That's why there is the "if max(in1,in2,ine)=1;" subseting IF statement.
3. The call missing (after the output statement) assures that none of the variable of interest will have values retained in the next observations. This could otherwise happen if FREQS1 and FREQS2 have 2 obs, but FREQS3 has only 1.
The resulting dataset will have each of the variables-of-interest in sorted order, because that is how proc freq would produce the FREQS data sets.

True, this program requires 2 passes through the data set, but it has a very simple structure. One could do it in one pass, using hash objects, but that would be a good deal more programming. I wouldn't consider the hash object approach unless the data set was very large, and not sorted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PeterClemmensen · Posted 01-15-2019 02:33 PM

So, the only variables of interest are the three last variables? And the middle observation is redundant because obs 1 and 3 combined provides the variable values of the second observation, correct?

The DATA to DATA Step Macro
Blog: SASnrd

Wolverine · Posted 01-15-2019 03:02 PM

I need all of the variables in the example records.

Obs 1 and 3 combined provides the values of Obs 2 in these particular examples. But the redundant records are not always Obs 2. Also, there may be any number of records for a particular kid on a particular date.

Wolverine · Posted 01-16-2019 07:42 AM

Conceptually, one potential solution would be to have SAS group all records that have the same ID and claim date. Then, within each of those groups, have SAS look at the DX codes and assign a flag variable. If a given code appears only once, it would be flagged as 0. Other codes appearing more than once would be flagged as 1. Repeat this process for proc and rev codes.

Then an IF statement could use those flags to delete records that don't contain any new information:

IF DX_flag > 0 AND Proc_flag > 0 AND Rev_flag > 0 THEN delete;

So that's the idea... can anyone help me figure out how to actually program that?

mkeintz · Posted 01-16-2019 10:13 AM

You want to remove any observation that is a subset of the union of all the other observations. So which one do you remove if you data are:

A B C

1 11 .

1 . 111

. 11 111

Any of these three observations could be removed. Which one does your rule require?

And also note this has to be a recursive process. Because if you find two observations that qualify for deletion, once you remove the first one, you have to recheck that the other can still be removed.

Perhaps what you should do instead is to consolidate records = i.e. synthetically create a minimal set of records that contain the entire collection of observed values. For instance, from the above 3 records, instead of deleting one of them (leaving 2 records), make a single new record with

A B C

1 11 111

even though this record doesn't exist in your original data. And of course the same principle can be applied if some variables have multiple values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Wolverine · Posted 01-16-2019 10:37 AM

I see your point. If I apply my rule to your example, it won't delete any records. I certainly like the idea of consolidating your example to a single line... now how do I actually program that?

mkeintz · Posted 01-16-2019 11:17 AM

In your sample data, it won't actually be a single line, since your first set has 2 values for diagnosis, 2 for prcdr, and 1 for revenue. So it should have 2 lines. The first line would have one of the diagnosis codes, one of the prcdr codes, and the only value for revenue code. The second would have the alternative value for diagnosis code, the alternative for prcdr code, and a missing value for revenue.

proc freq data=have noprint;
  tables encrypted_id*diagnosis_code_t / out=freqs1 (drop=percent count where=(diagnosis_code_t^=.));
  tables encrypted_id*prcdr_code       / out=freqs2 (drop=percent count where=(prcdr_code^=.));
  tables encrypted_id*revenue_code     / out=freqs3 (drop=percent count where=(revenue_code^=.));
run;
data want;
  merge have (drop=diagnosis_code_t prcdr_code revenue_code) 
        freqs1 (in=in1)  freqs2 (in=in2) freqs3 (in=in3);
  by id;
  if max(in1,in2,in3)=1 ;
  output;
  call missing (diagnosis_code_t,prcdr_code,revenue_code);
run;

The proc freq tabulates the cross-tabulation of ID by each of the variables of interest and outputs the frequencies to FREQS1, FREQS2, and FREQS3, respectively. Ordinarily the output dataset would have 4 variables: the ID var, the var of interest, the variable COUNT (actually frequency of the ID*var-of-interest), and PERCENT. WE don't need the last 2, so they are dropped, leaving only ID and var-of-interest.
1. The FREQS files will be ordered by ID and the var-of-interest.
2. The where-filter eliminates frequencies of missing values.
The DATA WANT step
1. Use the "match merge", which requires all data sets to be sorted by id.
2. There may be 3 records in have for a given ID, even though none of the FREQS data sets has more than 2. That's why there is the "if max(in1,in2,ine)=1;" subseting IF statement.
3. The call missing (after the output statement) assures that none of the variable of interest will have values retained in the next observations. This could otherwise happen if FREQS1 and FREQS2 have 2 obs, but FREQS3 has only 1.
The resulting dataset will have each of the variables-of-interest in sorted order, because that is how proc freq would produce the FREQS data sets.

True, this program requires 2 passes through the data set, but it has a very simple structure. One could do it in one pass, using hash objects, but that would be a good deal more programming. I wouldn't consider the hash object approach unless the data set was very large, and not sorted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Wolverine · Posted 01-24-2019 09:17 AM

Thank u not only for the code, but also for the explanation of what the code does. This is a great approach that was simple to program and provided exactly what I was looking for!

Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Re: Eliminate redunant information across 3 variables

Registration is open

Registration is open

SAS Training: Just a Click Away