BookmarkSubscribeRSS Feed
sas_student1
Quartz | Level 8

Hello SAS community!!!

 

I have been working on a SAS query all day today and am stuck. Thought to ask the much smarter people here for help.

 

I have two datasets, let’s call them dataA and DataB. Among many other variables that are different in the two-data set, they both have the same 4 common variables, Client ID (client_id), service date (srdate), Hospital id (hosp_id) and hospital Location (hosp_loc).

 

DataA has about 600,000 rows and DataB has about 800,000 rows.

 

There are 2 issues that I have: I want to match common rows by the above 4 elements but in 4 different series. That is. any rows that have the same client ID, hospital Location, Hospital Id and Service date in both dataset I want to match and move to one dataset while the rows that do not match I want to separate them down to two separate non-matched datasets. That is, non-match dataA is saved in a dataA1 table and non-match dataB is saved in a dataB1 table. Now I can do, via either a simple merge with a by statement of the 4 variables or a join via proc SQL to create a matched table (table match1), but am having trouble (issue 1) separating the two non-matched tables into two separate tables.

 

Issue 2: Another issue is that after the full merge above for the remaining rows that did not match I want to have two separate tables (i.e. DataA1 and DataB1). Here what I want to do is match again by client Id, hospital Id, hospital location, but this time if the service date in dataA1 is either 7 days before or 7 days after the service data in dataB1 then I want that to match and create a separate match2 table from above and then create two more non-matched tables for the remaining dataA1 and DataB1 tables (we call them DataA2 and Data B2). At the end I will union all the possible matched tables into one.

 

Suggestions on how to tackle the two issues would be greatly appreciated.

 

Thank you!

10 REPLIES 10
Reeza
Super User

You can use a data step merge with the IN data set option. 

 

http://documentation.sas.com/?docsetId=lestmtsref&docsetTarget=n1i8w2bwu1fn5kn1gpxj18xttbb0.htm&docs...

 

probably looks something like this:

 

Data Matched dataA_Only dataB_Only;
Merge ....;
By .... ;

If inA  and not inB then output DataA_only;
Else if inB and not inA then output DataB_only;
Else output matched;

Issue 2:

A SQL join is better in this case. The join condition can include dates, but may also generate multiple matches. 

sas_student1
Quartz | Level 8

Of course sometimes the simplest solution is the answer.

 

However for issue two how do I create separate tables.

So I join by the client ID, Hospital ID, Hospital Location and have the rule for the dates. But I then again want to create three databases so that if the two tables match by the above rule they are created in one table and if they do not match two separate tables are created.

How do we put that rule in an SQL statement?

 

Thanks

 

Tom
Super User Tom
Super User

You cannot make more than one output from an SQL statement.

You could generate one dataset and create a flag variable with three levels.  (or two binary variables) and then split them later using a data step. 

data A B C ;
  set sql_output ;
  if flag='A' then output A;
  else if flag='B' then output B;
  else output C;
run;

If you really need to save space then perhaps the SQL code is just a view so that only the output of the data step is saved.

Tom
Super User Tom
Super User

Sample data would help a lot.  Show both the inputs and results that you want to get. Make the sample data so that it has examples of all of the different results that you want to create.

Best is if you post the data in the form of a datastep that others can run.

data A;
  input key1 key2 key3 date A1-A3 ;
  informat date yymmdd.;
  format date yymmdd10.;
cards;
1 1 2 2017-01-01 1 2 3
1 1 2 2017-02-01 4 5 6
.....
sas_student1
Quartz | Level 8

Sure,

 

So here are the two original data set example:

 

dataA;
  input client_id hosp_id hosp_loc date;
  informat date mmddyy10.;
  format date mmddyy10.;
cards;

1 123 1234 10/1/2015
2 123 2345 10/1/2015
3 123 4325 2/3/2015
4 156 5678 5/6/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 178 3456 1/31/2017
8 178 4321 1/31/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017;

run;

 

dataB;
  input client_id hosp_id hosp_loc date;
  informat date mmddyy10.;
  format date mmddyy10.;
cards;

1 123 1234 10/1/2015
2 123 2345 10/5/2015
3 123 4325 2/1/2015
4 156 5678 5/20/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 138 8643 1/31/2017
8 178 4321 2/5/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017
13 167 4567 12/31/2016
14 145 3345 11/23/2016
15 123 1234 11/12/2016
16 999 9999 3/24/2016
17 178 4321 2/3/2016;

run;

 

using the merge statement with the different In statement we would get the following table:

 

Match 1   
client_idHospital_idHospital_locService_date
1123123410/1/2015
5167234510/11/2016
616789071/4/2017
717834561/31/2017
914554677/23/2017
1013423988/24/2017
119672897

9/13/2017

 

Leaving us with the remaining two tables:

Table DataA1

client_idHospital_idHospital_locService_date
2123234510/1/2015
312343252/3/2015
415656785/6/2016
817843211/31/2017

 

And Table DataB1

client_idHospital_idHospital_locService_date
2123234510/5/2015
312343252/1/2015
415656785/20/2016
817843212/5/2017
13167456712/31/2016
14145334511/23/2016
15123123411/12/2016
1699999993/24/2016
1717843212/3/2016

 

 

Now what I would want to do in this round is to take table A1 and B1 and match by the time condition so I would get the following matched table:

client_idHospital_idHospital_locService_date
2123234510/1/2015
312343252/1/2015
817843211/31/2017

 

And should have two more tables that show the rows that have not been matched.

so DataA2:

client_idHospital_idHospital_locService_date
415656785/6/2016

 

and DataB2:

 

client_idHospital_idHospital_locService_date
415656785/20/2016
13167456712/31/2016
14145334511/23/2016
15123123411/12/2016
1699999993/24/2016
1717843212/3/2016

 

Thanks again!

 

PGStats
Opal | Level 21

It could be done this way, using a view, but it will break if consecutive dates are less than 7 days apart for a given id:

 

data a;
input id date date9.;
format date yymmdd10.;
datalines;
1 15jan2017
1 23feb2017
1 08mar2017
;

data b;
input id date date9.;
format date yymmdd10.;
datalines;
1 15jan2017
1 25feb2017
1 23mar2017
2 01apr2017
;

data c / view=c;
set b;
dateb = date;
format dateb yymmdd10.;
do shift = -7 to 7;
    date = intnx("day", dateb, shift);
    output;
    end;
run;

data aa(drop=dateb) bb(drop=date) match shifted;
retain found;
merge a (in=ina) c(in=inc);
by id date;
if shift=-7 then found = 0;
if ina and inc then do;
    if shift = 0 
        then output match;
        else output shifted; 
    found = 1;
    end;
else if ina and not inc then output aa;
else if shift=7 and not found then output bb;
drop found shift;
run;
PG
sas_student1
Quartz | Level 8

Thank you PG,

There is a possibility that consecutive dates are less than 7 days. Since I am dealing with thousands of rows.

How do I not run into that problem?

 

Thanks

PGStats
Opal | Level 21

If two observations are less than 7 days apart it is very likely that they will both match the same date from the other dataset. Let's say that you have dates April 1,  2, and 3 in dataA and April 5, 6, 7 and 8 in dataB. You need a rule to choose 3 out of 12 possible matches. The problem can't be solved unless it is completely defined.

PG
mkeintz
PROC Star

Since both datasets are sorted by four keys and you are looking for exact match on the first 3 keys, and a difference <=7 on the last key, you can do this in a single DATA step.  But first note that since you want cases where the DATE can differ between a and b, you need to name the variables as DATE_A and DATE_B:

 

data A;
  input client_id hosp_id hosp_loc date_A ;
  informat date mmddyy10.;
  format date mmddyy10.;
cards;
1 123 1234 10/1/2015
2 123 2345 10/1/2015
3 123 4325 2/3/2015
4 156 5678 5/6/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 178 3456 1/31/2017
8 178 4321 1/31/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017
run;

data B;
  input client_id hosp_id hosp_loc date_B XB;
  informat date mmddyy10.;
  format date mmddyy10.;
cards;

1 123 1234 10/1/2015 
2 123 2345 10/5/2015 
3 123 4325 2/1/2015
4 156 5678 5/20/2016 
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 138 8643 1/31/2017
8 178 4321 2/5/2017 
9 145 5467 7/23/2017 
10 134 2398 8/24/2017
11 967 2897 9/13/2017
13 167 4567 12/31/2016
14 145 3345 11/23/2016
15 123 1234 11/12/2016
16 999 9999 3/24/2016
17 178 4321 2/3/2016
run;


proc sql noprint;
  select name into :avars separated by ' '
    from dictionary.columns where libname='WORK' and memname='A';
  select name into :bvars separated by ' '
    from dictionary.columns where libname='WORK' and memname='B';
quit;

data match (keep=&avars &bvars) aonly (keep=&avars);
  set a end=end_of_a;

  if _n_=1 then do;
    if 0 then set b;
    declare hash bdata (dataset:'b',ordered:'A');
      bdata.definekey ('client_id','hosp_id','hosp_loc','date_b');
      bdata.definedata(all:'Y');
      bdata.definedone();
  end;

  date_b=date_a;

  rc=bdata.find();
  if rc^=0 then do delta=1 to 7 until(rc=0);  /*If no exact match, start at one day apart*/
    date_b=date_a-delta;    /*Check for date_b PRECEDING date_a */
    rc=bdata.find();
    if rc=0 then leave;
    date_b=date_a+delta;    /*Otherwise check for date_b FOLLOWING date_a*/
    rc=bdata.find();
  end;

  if rc=0 then do;
    output match;
    rc=bdata.remove();
  end;
  else output aonly;
  if end_of_a then rc=bdata.output(dataset:'bonly');
run;

 

 

  1. The principle here is to bulk load B into a hash object.  Then read data set A one record at a time, and look for B dataitems (i.e. "observations" from zero to seven days away.  If such a b data item is found, output MATCH and remove the data item from the hash object.  If none if found, output AONLY.  At the end of the data set output the remaining hash object to BONLY.
  2. I use proc sql to create macrovar AVARS which lists only the vars in data set A.  This allows the AONLY data set to keep only the relevant variables.
  3. You'll want dataset A to be sorted to get consistent results.  It doesn't matter if dataset B is pre-sorted.  The hash object is set to sort order, so B_ONLY will be sorted.
  4. As a general rule, choose the smaller dataset to put into the hash object - saves memory.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
sas_student1
Quartz | Level 8

Thank you!

I will try this and thank you everyone for your suggestions and help.

 

I did try a method that required a few steps.

 

That is, in the two tables that I wanted to match I created a primary key of sorts for each row i.e id=_n_ which I would carry as I run an SQL query by the conditions I wanted to match the two data sets along with the time frame. This SQL creates a table with only the matched rows. Since I had the column with the primary key in this matched table I can go back to the original tables and identify which two rows were matched. I then run another SQL where I take out the matched rows from the original table. This method is not perfect as there are some duplicates, but it gets to what I need.

 

Thank you all for your assistance!!

 

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 5652 views
  • 0 likes
  • 5 in conversation