BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Wolverine
Pyrite | Level 9

I'm working with medical claims data, and I'm trying to create a file that will show the medical histories for a select group of kids.  For each claim date for each kid, I need a list of diagnosis, procedure, and revenue codes.  But many of the records do not include any new information.  In each of the 2 examples below, the middle line contains information that is already present in the other lines.  So the middle line is redundant and could be deleted.

encrypted_id DATE_OF_BIRTH GENDER CLAIM_SERVICE_FROM_DATE DIAGNOSIS_CODE_t PRCDR_CODE REVENUE_CODE
02D957E3EDF5 23-Aug-05 F 26-Apr-16 H52223 V2715 0611
02D957E3EDF5 23-Aug-05 F 26-Apr-16 H5203 V2715  
02D957E3EDF5 23-Aug-05 F 26-Apr-16 H5203 V2020  
             
encrypted_id DATE_OF_BIRTH GENDER CLAIM_SERVICE_FROM_DATE DIAGNOSIS_CODE_t PRCDR_CODE REVENUE_CODE
14A17629B53944 31-Jul-07 F 20-Apr-17 R480 99174 0610
14A17629B53944 31-Jul-07 F 20-Apr-17 Z6852 99174  
14A17629B53944 31-Jul-07 F 20-Apr-17 Z6852 99214  

 

Based on an approach I found in another post, I tried the code below.  But for some reason, the rev_new variable contained diagnosis codes and DX_new contained rev codes.  And there was still redundant information in the output file.

 

DATA temp.numer_ID_codesort; SET temp.numer_ID_list_dd;

LENGTH DX_new proc_new rev_new $10;

DX_new = DIAGNOSIS_CODE_t;
proc_new = PRCDR_CODE;
rev_new = REVENUE_CODE;

CALL SORTC(rev_new,proc_new,DX_new);
RUN;

PROC SORT DATA = temp.numer_ID_codesort OUT = temp.numer_ID_final nodupkey;
BY DX_new proc_new rev_new;
RUN;

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

In your sample data, it won't actually be a single line, since your first set has 2 values for diagnosis, 2 for prcdr, and 1 for revenue.  So it should have 2 lines.  The first line would have one of the diagnosis codes, one of the prcdr codes, and the only value for revenue code.  The second would have the alternative value for diagnosis code, the alternative for prcdr code, and a missing value for revenue.

 

proc freq data=have noprint;
  tables encrypted_id*diagnosis_code_t / out=freqs1 (drop=percent count where=(diagnosis_code_t^=.));
  tables encrypted_id*prcdr_code       / out=freqs2 (drop=percent count where=(prcdr_code^=.));
  tables encrypted_id*revenue_code     / out=freqs3 (drop=percent count where=(revenue_code^=.));
run;
data want;
  merge have (drop=diagnosis_code_t prcdr_code revenue_code) 
        freqs1 (in=in1)  freqs2 (in=in2) freqs3 (in=in3);
  by id;
  if max(in1,in2,in3)=1 ;
  output;
  call missing (diagnosis_code_t,prcdr_code,revenue_code);
run;

 

  1. The proc freq tabulates the cross-tabulation of ID by each of the variables of interest and outputs the frequencies to FREQS1, FREQS2, and FREQS3, respectively.  Ordinarily the output dataset would have 4 variables: the ID var, the var of interest, the variable COUNT (actually frequency of the ID*var-of-interest), and PERCENT.  WE don't need the last 2, so they are dropped, leaving only ID and var-of-interest.
    1. The FREQS files will be ordered by ID and the var-of-interest.
    2. The where-filter eliminates frequencies of missing values.
  2. The DATA WANT step
    1. Use the "match merge", which requires all data sets to be sorted by id.
    2. There may be 3 records in have for a given ID, even though none of the FREQS data sets has more than 2.  That's why there is the "if max(in1,in2,ine)=1;"  subseting IF statement.
    3. The call missing (after the output statement) assures that none of the variable of interest will have values retained in the next observations.  This could otherwise happen if FREQS1 and FREQS2 have 2 obs, but FREQS3 has only 1.
  3. The resulting dataset  will have each of the variables-of-interest in sorted order, because that is how proc freq would produce the FREQS data sets.

True, this program requires 2 passes through the data set, but it has a very simple structure.  One could do it in one pass, using hash objects, but that would be a good deal more programming.  I wouldn't consider the hash object approach unless the data set was very large, and not sorted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

7 REPLIES 7
PeterClemmensen
Tourmaline | Level 20

So, the only variables of interest are the three last variables? And the middle observation is redundant because obs 1 and 3 combined provides the variable values of the second observation, correct?

Wolverine
Pyrite | Level 9

I need all of the variables in the example records.

 

Obs 1 and 3 combined provides the values of Obs 2 in these particular examples.  But the redundant records are not always Obs 2.  Also, there may be any number of records for a particular kid on a particular date.

Wolverine
Pyrite | Level 9

Conceptually, one potential solution would be to have SAS group all records that have the same ID and claim date.  Then, within each of those groups, have SAS look at the DX codes and assign a flag variable.  If a given code appears only once, it would be flagged as 0.  Other codes appearing more than once would be flagged as 1.  Repeat this process for proc and rev codes.

 

Then an IF statement could use those flags to delete records that don't contain any new information:

 

IF DX_flag > 0 AND Proc_flag > 0 AND Rev_flag > 0 THEN delete;

 

So that's the idea... can anyone help me figure out how to actually program that?

mkeintz
PROC Star

You want to remove any observation that is a subset of the union of all the other observations.  So which one do you remove if you data are:

 

      A    B      C

      1   11      .

      1    .    111

      .   11    111

 

Any of these three observations could be removed.  Which one does your rule require? 

 

And also note this has to be a recursive process.  Because if you find two observations that qualify for deletion, once you remove the first one, you have to recheck that the other can still be removed. 

 

Perhaps what you should do instead is to consolidate records = i.e. synthetically create a minimal set of records that contain the entire collection of observed values.  For instance, from the above 3 records, instead of deleting one of them (leaving 2 records), make a single new record with

 

      A    B      C

      1   11    111

 

even though this record doesn't exist in your original data.  And of course the same principle can be applied if some variables have multiple values.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Wolverine
Pyrite | Level 9

I see your point.  If I apply my rule to your example, it won't delete any records.  I certainly like the idea of consolidating your example to a single line... now how do I actually program that?

mkeintz
PROC Star

In your sample data, it won't actually be a single line, since your first set has 2 values for diagnosis, 2 for prcdr, and 1 for revenue.  So it should have 2 lines.  The first line would have one of the diagnosis codes, one of the prcdr codes, and the only value for revenue code.  The second would have the alternative value for diagnosis code, the alternative for prcdr code, and a missing value for revenue.

 

proc freq data=have noprint;
  tables encrypted_id*diagnosis_code_t / out=freqs1 (drop=percent count where=(diagnosis_code_t^=.));
  tables encrypted_id*prcdr_code       / out=freqs2 (drop=percent count where=(prcdr_code^=.));
  tables encrypted_id*revenue_code     / out=freqs3 (drop=percent count where=(revenue_code^=.));
run;
data want;
  merge have (drop=diagnosis_code_t prcdr_code revenue_code) 
        freqs1 (in=in1)  freqs2 (in=in2) freqs3 (in=in3);
  by id;
  if max(in1,in2,in3)=1 ;
  output;
  call missing (diagnosis_code_t,prcdr_code,revenue_code);
run;

 

  1. The proc freq tabulates the cross-tabulation of ID by each of the variables of interest and outputs the frequencies to FREQS1, FREQS2, and FREQS3, respectively.  Ordinarily the output dataset would have 4 variables: the ID var, the var of interest, the variable COUNT (actually frequency of the ID*var-of-interest), and PERCENT.  WE don't need the last 2, so they are dropped, leaving only ID and var-of-interest.
    1. The FREQS files will be ordered by ID and the var-of-interest.
    2. The where-filter eliminates frequencies of missing values.
  2. The DATA WANT step
    1. Use the "match merge", which requires all data sets to be sorted by id.
    2. There may be 3 records in have for a given ID, even though none of the FREQS data sets has more than 2.  That's why there is the "if max(in1,in2,ine)=1;"  subseting IF statement.
    3. The call missing (after the output statement) assures that none of the variable of interest will have values retained in the next observations.  This could otherwise happen if FREQS1 and FREQS2 have 2 obs, but FREQS3 has only 1.
  3. The resulting dataset  will have each of the variables-of-interest in sorted order, because that is how proc freq would produce the FREQS data sets.

True, this program requires 2 passes through the data set, but it has a very simple structure.  One could do it in one pass, using hash objects, but that would be a good deal more programming.  I wouldn't consider the hash object approach unless the data set was very large, and not sorted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Wolverine
Pyrite | Level 9
Thank u not only for the code, but also for the explanation of what the code does. This is a great approach that was simple to program and provided exactly what I was looking for!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1605 views
  • 0 likes
  • 3 in conversation