How to merge two datasets and create three seperate datasets.

sas_student1 · Posted 11-24-2017 11:31 PM

Hello SAS community!!!

I have been working on a SAS query all day today and am stuck. Thought to ask the much smarter people here for help.

I have two datasets, let’s call them dataA and DataB. Among many other variables that are different in the two-data set, they both have the same 4 common variables, Client ID (client_id), service date (srdate), Hospital id (hosp_id) and hospital Location (hosp_loc).

DataA has about 600,000 rows and DataB has about 800,000 rows.

There are 2 issues that I have: I want to match common rows by the above 4 elements but in 4 different series. That is. any rows that have the same client ID, hospital Location, Hospital Id and Service date in both dataset I want to match and move to one dataset while the rows that do not match I want to separate them down to two separate non-matched datasets. That is, non-match dataA is saved in a dataA1 table and non-match dataB is saved in a dataB1 table. Now I can do, via either a simple merge with a by statement of the 4 variables or a join via proc SQL to create a matched table (table match1), but am having trouble (issue 1) separating the two non-matched tables into two separate tables.

Issue 2: Another issue is that after the full merge above for the remaining rows that did not match I want to have two separate tables (i.e. DataA1 and DataB1). Here what I want to do is match again by client Id, hospital Id, hospital location, but this time if the service date in dataA1 is either 7 days before or 7 days after the service data in dataB1 then I want that to match and create a separate match2 table from above and then create two more non-matched tables for the remaining dataA1 and DataB1 tables (we call them DataA2 and Data B2). At the end I will union all the possible matched tables into one.

Suggestions on how to tackle the two issues would be greatly appreciated.

Thank you!

Reeza · Posted 11-25-2017 12:12 AM

You can use a data step merge with the IN data set option.

http://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n1i8w2bwu1fn5kn1gpxj18xttbb0.htm&docs...

probably looks something like this:

Data Matched dataA_Only dataB_Only;
Merge ....;
By .... ;

If inA  and not inB then output DataA_only;
Else if inB and not inA then output DataB_only;
Else output matched;

Issue 2:

A SQL join is better in this case. The join condition can include dates, but may also generate multiple matches.

sas_student1 · Posted 11-25-2017 01:17 AM

Of course sometimes the simplest solution is the answer.

However for issue two how do I create separate tables.

So I join by the client ID, Hospital ID, Hospital Location and have the rule for the dates. But I then again want to create three databases so that if the two tables match by the above rule they are created in one table and if they do not match two separate tables are created.

How do we put that rule in an SQL statement?

Thanks

Tom · Posted 11-25-2017 01:42 AM

You cannot make more than one output from an SQL statement.

You could generate one dataset and create a flag variable with three levels. (or two binary variables) and then split them later using a data step.

data A B C ;
  set sql_output ;
  if flag='A' then output A;
  else if flag='B' then output B;
  else output C;
run;

If you really need to save space then perhaps the SQL code is just a view so that only the output of the data step is saved.

Tom · Posted 11-25-2017 01:48 AM

Sample data would help a lot. Show both the inputs and results that you want to get. Make the sample data so that it has examples of all of the different results that you want to create.

Best is if you post the data in the form of a datastep that others can run.

data A;
  input key1 key2 key3 date A1-A3 ;
  informat date yymmdd.;
  format date yymmdd10.;
cards;
1 1 2 2017-01-01 1 2 3
1 1 2 2017-02-01 4 5 6
.....

sas_student1 · Posted 11-25-2017 02:32 AM

Sure,

So here are the two original data set example:

dataA;
input client_id hosp_id hosp_loc date;
informat date mmddyy10.;
format date mmddyy10.;
cards;

1 123 1234 10/1/2015
2 123 2345 10/1/2015
3 123 4325 2/3/2015
4 156 5678 5/6/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 178 3456 1/31/2017
8 178 4321 1/31/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017;

run;

dataB;
input client_id hosp_id hosp_loc date;
informat date mmddyy10.;
format date mmddyy10.;
cards;

1 123 1234 10/1/2015
2 123 2345 10/5/2015
3 123 4325 2/1/2015
4 156 5678 5/20/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 138 8643 1/31/2017
8 178 4321 2/5/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017
13 167 4567 12/31/2016
14 145 3345 11/23/2016
15 123 1234 11/12/2016
16 999 9999 3/24/2016
17 178 4321 2/3/2016;

run;

using the merge statement with the different In statement we would get the following table:

Match 1
client_id	Hospital_id	Hospital_loc	Service_date
1	123	1234	10/1/2015
5	167	2345	10/11/2016
6	167	8907	1/4/2017
7	178	3456	1/31/2017
9	145	5467	7/23/2017
10	134	2398	8/24/2017
11	967	2897	9/13/2017

Leaving us with the remaining two tables:

Table DataA1

client_id	Hospital_id	Hospital_loc	Service_date
2	123	2345	10/1/2015
3	123	4325	2/3/2015
4	156	5678	5/6/2016
8	178	4321	1/31/2017

And Table DataB1

client_id	Hospital_id	Hospital_loc	Service_date
2	123	2345	10/5/2015
3	123	4325	2/1/2015
4	156	5678	5/20/2016
8	178	4321	2/5/2017
13	167	4567	12/31/2016
14	145	3345	11/23/2016
15	123	1234	11/12/2016
16	999	9999	3/24/2016
17	178	4321	2/3/2016

Now what I would want to do in this round is to take table A1 and B1 and match by the time condition so I would get the following matched table:

client_id	Hospital_id	Hospital_loc	Service_date
2	123	2345	10/1/2015
3	123	4325	2/1/2015
8	178	4321	1/31/2017

And should have two more tables that show the rows that have not been matched.

so DataA2:

client_id	Hospital_id	Hospital_loc	Service_date
4	156	5678	5/6/2016

and DataB2:

client_id	Hospital_id	Hospital_loc	Service_date
4	156	5678	5/20/2016
13	167	4567	12/31/2016
14	145	3345	11/23/2016
15	123	1234	11/12/2016
16	999	9999	3/24/2016
17	178	4321	2/3/2016

Thanks again!

PGStats · Posted 11-25-2017 02:18 AM

It could be done this way, using a view, but it will break if consecutive dates are less than 7 days apart for a given id:

data a;
input id date date9.;
format date yymmdd10.;
datalines;
1 15jan2017
1 23feb2017
1 08mar2017
;

data b;
input id date date9.;
format date yymmdd10.;
datalines;
1 15jan2017
1 25feb2017
1 23mar2017
2 01apr2017
;

data c / view=c;
set b;
dateb = date;
format dateb yymmdd10.;
do shift = -7 to 7;
    date = intnx("day", dateb, shift);
    output;
    end;
run;

data aa(drop=dateb) bb(drop=date) match shifted;
retain found;
merge a (in=ina) c(in=inc);
by id date;
if shift=-7 then found = 0;
if ina and inc then do;
    if shift = 0 
        then output match;
        else output shifted; 
    found = 1;
    end;
else if ina and not inc then output aa;
else if shift=7 and not found then output bb;
drop found shift;
run;

PG

sas_student1 · Posted 11-25-2017 02:34 AM

Thank you PG,

There is a possibility that consecutive dates are less than 7 days. Since I am dealing with thousands of rows.

How do I not run into that problem?

Thanks

PGStats · Posted 11-25-2017 03:20 PM

If two observations are less than 7 days apart it is very likely that they will both match the same date from the other dataset. Let's say that you have dates April 1, 2, and 3 in dataA and April 5, 6, 7 and 8 in dataB. You need a rule to choose 3 out of 12 possible matches. The problem can't be solved unless it is completely defined.

PG

mkeintz · Posted 11-27-2017 08:41 PM

Since both datasets are sorted by four keys and you are looking for exact match on the first 3 keys, and a difference <=7 on the last key, you can do this in a single DATA step. But first note that since you want cases where the DATE can differ between a and b, you need to name the variables as DATE_A and DATE_B:

data A;
  input client_id hosp_id hosp_loc date_A ;
  informat date mmddyy10.;
  format date mmddyy10.;
cards;
1 123 1234 10/1/2015
2 123 2345 10/1/2015
3 123 4325 2/3/2015
4 156 5678 5/6/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 178 3456 1/31/2017
8 178 4321 1/31/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017
run;

data B;
  input client_id hosp_id hosp_loc date_B XB;
  informat date mmddyy10.;
  format date mmddyy10.;
cards;

1 123 1234 10/1/2015 
2 123 2345 10/5/2015 
3 123 4325 2/1/2015
4 156 5678 5/20/2016 
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 138 8643 1/31/2017
8 178 4321 2/5/2017 
9 145 5467 7/23/2017 
10 134 2398 8/24/2017
11 967 2897 9/13/2017
13 167 4567 12/31/2016
14 145 3345 11/23/2016
15 123 1234 11/12/2016
16 999 9999 3/24/2016
17 178 4321 2/3/2016
run;


proc sql noprint;
  select name into :avars separated by ' '
    from dictionary.columns where libname='WORK' and memname='A';
  select name into :bvars separated by ' '
    from dictionary.columns where libname='WORK' and memname='B';
quit;

data match (keep=&avars &bvars) aonly (keep=&avars);
  set a end=end_of_a;

  if _n_=1 then do;
    if 0 then set b;
    declare hash bdata (dataset:'b',ordered:'A');
      bdata.definekey ('client_id','hosp_id','hosp_loc','date_b');
      bdata.definedata(all:'Y');
      bdata.definedone();
  end;

  date_b=date_a;

  rc=bdata.find();
  if rc^=0 then do delta=1 to 7 until(rc=0);  /*If no exact match, start at one day apart*/
    date_b=date_a-delta;    /*Check for date_b PRECEDING date_a */
    rc=bdata.find();
    if rc=0 then leave;
    date_b=date_a+delta;    /*Otherwise check for date_b FOLLOWING date_a*/
    rc=bdata.find();
  end;

  if rc=0 then do;
    output match;
    rc=bdata.remove();
  end;
  else output aonly;
  if end_of_a then rc=bdata.output(dataset:'bonly');
run;

The principle here is to bulk load B into a hash object. Then read data set A one record at a time, and look for B dataitems (i.e. "observations" from zero to seven days away. If such a b data item is found, output MATCH and remove the data item from the hash object. If none if found, output AONLY. At the end of the data set output the remaining hash object to BONLY.
I use proc sql to create macrovar AVARS which lists only the vars in data set A. This allows the AONLY data set to keep only the relevant variables.
You'll want dataset A to be sorted to get consistent results. It doesn't matter if dataset B is pre-sorted. The hash object is set to sort order, so B_ONLY will be sorted.
As a general rule, choose the smaller dataset to put into the hash object - saves memory.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas_student1 · Posted 11-28-2017 01:12 PM

Thank you!

I will try this and thank you everyone for your suggestions and help.

I did try a method that required a few steps.

That is, in the two tables that I wanted to match I created a primary key of sorts for each row i.e id=_n_ which I would carry as I run an SQL query by the conditions I wanted to match the two data sets along with the time frame. This SQL creates a table with only the matched rows. Since I had the column with the primary key in this matched table I can go back to the original tables and identify which two rows were matched. I then run another SQL where I take out the matched rows from the original table. This method is not perfect as there are some duplicates, but it gets to what I need.

Thank you all for your assistance!!

How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Re: How to merge two datasets and create three seperate datasets.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!