BookmarkSubscribeRSS Feed
crunchit
Obsidian | Level 7

Does anyone have experience with combining/joining/merging (or whatever the best term is) different claims data? I have created 3 files containing information on treatments received by patients in 3 different settings (inpatient, outpatient, and partD) across a period of time hence patients have multiple observations depicting the treatment sequence.

 

It is possible that a patient could have received treatment in different settings at some different points in time and thus could have records in all of the three, hence I'm not sure merging/joining will be the best approach since I may lose information in the patient journey. I don't know if appending is the best either.

 

The point is I want to combine these 3 files into one overarching treatment file and still preserve the information in these separate files even if the variables are not common in all 3. That being said there are some common variables among the files.

 

I also want to be able to create an indicator variable in the merged file so I know where the event occurred. (source of file e.g Outpat, INp or Partd)

Any help will greatly be appreciated.

 

Thanks Note: the code presented here is the idea but not outputting the results I want and I know something is wrong hence my cry for help. I hope this makes sense.

SAS_help.jpg

4 REPLIES 4
ballardw
Super User

I suspect that you do not want a join at all.

Here is what I think might be more useful

 

data drgs_merged;
    set table1
         table2
         table3
         indsname=dsn
   ;
   SourceFile=dsn;
run;

This appends (or if you must think in SQL terms, unions) the datasets and adds a variable SourceFile that will have the name of the individual dataset. The Indsname= option creates a temporary variable that I have named DSN above that holds the name of the data set contributing the current observation. Temporary means that it will disappear so to keep the value(s) assign it to a variable.

One reason I suspect very strongly that you do not want the join is because each value of ID is going to end up matched with every value of ID in the other sets and create multiple records. If ID=123, for example, appears in Table2 1 three times, Table2 two times and Table3 four times you would have ID=123 3*2*4=24 times in the output data set.

The proposed solution I show will have each ID from each table only appear as many times total, i.e. 3+2+4=9.

 

Warning: If any of variables of the same name have different characteristics you may need more work. The append will not work if any variable of the same name has a different type in different sets (numeric in some, character in other set). Also if the same named variables are character but different lengths you might have data truncated. If you get that warning the solution to avoid truncation is to add a LENGTH statement before the SET statement with the name of the variable and a length equal to (or greater than) the longest defined value. Suppose you get a warning about the variable of DX has multiple lengths. Then add something like:   Length dx $25. ; The number you use depends on the length needed.  If you have any questions use PROC CONTENTS for each of the data sets to find out the Length assigned to all the variables.

crunchit
Obsidian | Level 7

Thanks @ballardw I suspected the basic approach as well but wasn't very confident that was the right approach. I agree with the Cartesian issue. I did run into that and had like 10x the data size. I will give this simple approach a shot and see how the data turns out. I appreciate it.

mkeintz
PROC Star

There is a possibly useful alternative to the unconditional SET coding offered by @ballardw - namely conditional SETs.

 

Assuming you might want a date by date history  (i.e. one record per date) containing the most recent data from each of the three sources (INPAT, OUTPAT, and PARTD), you can use a technique like the below (for sources sorted by ID/DATE).

 

data want;
  set inpat  (keep=id date in=_in)
      outpat (keep=id date in=_out)
      partd  (keep=id date in=_partd) ;
  by id date;
  if _in    then set inpat  (rename=(date=date_inpat  dx=dx_inpat  hc=hc_inpat));
  if _out   then set outpat (rename=(date=date_outpat dx=dx_outpat hc=hc_outpat));
  if _partd then set partd  (rename=(date=date_partd  dx=dx_partd  hc=hc_partd));

  if last.date then output;
  if last.id then call missing(of _all_);
run;

Unlike the unconditional SET, variables from one source are NOT removed when data from another source are encountered, even if they are not from the same date.  Of course, it requires renaming all variables common to multiple sources, so you have no variable collisions.  The code above will generate a dataset with one record per ID/DATE, containing

  1. ID and DATE (the most recent date among the 3 sources)
  2. DATE_INPAT  DX_INPAT HC_INPAT: the most recent DATE, DX, and HC from INPAT.
  3. DATE_OUTPAT  DX_OUTPAT HC_OUTPAT: the most recent DATE, DX, and HC from OUTPAT.
  4. DATE_PARTD  DX_PARTD HC_PARTD: the most recent DATE, DX, and HC from INPAT.
  5. Plus all the uniquely named variables from each source.

 

The advantage of this you have simultaneous information on most recent data from all sources, which your analyst might want to see.  You can also modify this code to remove "stale" data - (say the most recent PARTD is over 6 months old - so you could set just the PARTD vars to missing).

 

Caveat: this program assumes each source has no more than one record per ID/DATE.  You could even make a minor tweak to accommodate cases where a given date has multiple obs from just one of the three source datasets (it need not be the same source for every multi-record date).    Of course, if instead of simple dates, you have DATETIME stamps, you're less likely to encounter instances of multiple records from a given source with the same datetime stamp.

 

You can see more about this approach, if interested, in my 2020 paper:  History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
crunchit
Obsidian | Level 7

thanks, @mkeintz, and for the reference. Good stuff

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 837 views
  • 2 likes
  • 3 in conversation