Hello SAS community!!!
I have been working on a SAS query all day today and am stuck. Thought to ask the much smarter people here for help.
I have two datasets, let’s call them dataA and DataB. Among many other variables that are different in the two-data set, they both have the same 4 common variables, Client ID (client_id), service date (srdate), Hospital id (hosp_id) and hospital Location (hosp_loc).
DataA has about 600,000 rows and DataB has about 800,000 rows.
There are 2 issues that I have: I want to match common rows by the above 4 elements but in 4 different series. That is. any rows that have the same client ID, hospital Location, Hospital Id and Service date in both dataset I want to match and move to one dataset while the rows that do not match I want to separate them down to two separate non-matched datasets. That is, non-match dataA is saved in a dataA1 table and non-match dataB is saved in a dataB1 table. Now I can do, via either a simple merge with a by statement of the 4 variables or a join via proc SQL to create a matched table (table match1), but am having trouble (issue 1) separating the two non-matched tables into two separate tables.
Issue 2: Another issue is that after the full merge above for the remaining rows that did not match I want to have two separate tables (i.e. DataA1 and DataB1). Here what I want to do is match again by client Id, hospital Id, hospital location, but this time if the service date in dataA1 is either 7 days before or 7 days after the service data in dataB1 then I want that to match and create a separate match2 table from above and then create two more non-matched tables for the remaining dataA1 and DataB1 tables (we call them DataA2 and Data B2). At the end I will union all the possible matched tables into one.
Suggestions on how to tackle the two issues would be greatly appreciated.
Thank you!
You can use a data step merge with the IN data set option.
probably looks something like this:
Data Matched dataA_Only dataB_Only;
Merge ....;
By .... ;
If inA and not inB then output DataA_only;
Else if inB and not inA then output DataB_only;
Else output matched;
Issue 2:
A SQL join is better in this case. The join condition can include dates, but may also generate multiple matches.
Of course sometimes the simplest solution is the answer.
However for issue two how do I create separate tables.
So I join by the client ID, Hospital ID, Hospital Location and have the rule for the dates. But I then again want to create three databases so that if the two tables match by the above rule they are created in one table and if they do not match two separate tables are created.
How do we put that rule in an SQL statement?
Thanks
You cannot make more than one output from an SQL statement.
You could generate one dataset and create a flag variable with three levels. (or two binary variables) and then split them later using a data step.
data A B C ;
set sql_output ;
if flag='A' then output A;
else if flag='B' then output B;
else output C;
run;
If you really need to save space then perhaps the SQL code is just a view so that only the output of the data step is saved.
Sample data would help a lot. Show both the inputs and results that you want to get. Make the sample data so that it has examples of all of the different results that you want to create.
Best is if you post the data in the form of a datastep that others can run.
data A;
input key1 key2 key3 date A1-A3 ;
informat date yymmdd.;
format date yymmdd10.;
cards;
1 1 2 2017-01-01 1 2 3
1 1 2 2017-02-01 4 5 6
.....
Sure,
So here are the two original data set example:
dataA;
input client_id hosp_id hosp_loc date;
informat date mmddyy10.;
format date mmddyy10.;
cards;
1 123 1234 10/1/2015
2 123 2345 10/1/2015
3 123 4325 2/3/2015
4 156 5678 5/6/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 178 3456 1/31/2017
8 178 4321 1/31/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017;
run;
dataB;
input client_id hosp_id hosp_loc date;
informat date mmddyy10.;
format date mmddyy10.;
cards;
1 123 1234 10/1/2015
2 123 2345 10/5/2015
3 123 4325 2/1/2015
4 156 5678 5/20/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 138 8643 1/31/2017
8 178 4321 2/5/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017
13 167 4567 12/31/2016
14 145 3345 11/23/2016
15 123 1234 11/12/2016
16 999 9999 3/24/2016
17 178 4321 2/3/2016;
run;
using the merge statement with the different In statement we would get the following table:
Match 1 | |||
client_id | Hospital_id | Hospital_loc | Service_date |
1 | 123 | 1234 | 10/1/2015 |
5 | 167 | 2345 | 10/11/2016 |
6 | 167 | 8907 | 1/4/2017 |
7 | 178 | 3456 | 1/31/2017 |
9 | 145 | 5467 | 7/23/2017 |
10 | 134 | 2398 | 8/24/2017 |
11 | 967 | 2897 | 9/13/2017 |
Leaving us with the remaining two tables:
Table DataA1
client_id | Hospital_id | Hospital_loc | Service_date |
2 | 123 | 2345 | 10/1/2015 |
3 | 123 | 4325 | 2/3/2015 |
4 | 156 | 5678 | 5/6/2016 |
8 | 178 | 4321 | 1/31/2017 |
And Table DataB1
client_id | Hospital_id | Hospital_loc | Service_date |
2 | 123 | 2345 | 10/5/2015 |
3 | 123 | 4325 | 2/1/2015 |
4 | 156 | 5678 | 5/20/2016 |
8 | 178 | 4321 | 2/5/2017 |
13 | 167 | 4567 | 12/31/2016 |
14 | 145 | 3345 | 11/23/2016 |
15 | 123 | 1234 | 11/12/2016 |
16 | 999 | 9999 | 3/24/2016 |
17 | 178 | 4321 | 2/3/2016 |
Now what I would want to do in this round is to take table A1 and B1 and match by the time condition so I would get the following matched table:
client_id | Hospital_id | Hospital_loc | Service_date |
2 | 123 | 2345 | 10/1/2015 |
3 | 123 | 4325 | 2/1/2015 |
8 | 178 | 4321 | 1/31/2017 |
And should have two more tables that show the rows that have not been matched.
so DataA2:
client_id | Hospital_id | Hospital_loc | Service_date |
4 | 156 | 5678 | 5/6/2016 |
and DataB2:
client_id | Hospital_id | Hospital_loc | Service_date |
4 | 156 | 5678 | 5/20/2016 |
13 | 167 | 4567 | 12/31/2016 |
14 | 145 | 3345 | 11/23/2016 |
15 | 123 | 1234 | 11/12/2016 |
16 | 999 | 9999 | 3/24/2016 |
17 | 178 | 4321 | 2/3/2016 |
Thanks again!
It could be done this way, using a view, but it will break if consecutive dates are less than 7 days apart for a given id:
data a;
input id date date9.;
format date yymmdd10.;
datalines;
1 15jan2017
1 23feb2017
1 08mar2017
;
data b;
input id date date9.;
format date yymmdd10.;
datalines;
1 15jan2017
1 25feb2017
1 23mar2017
2 01apr2017
;
data c / view=c;
set b;
dateb = date;
format dateb yymmdd10.;
do shift = -7 to 7;
date = intnx("day", dateb, shift);
output;
end;
run;
data aa(drop=dateb) bb(drop=date) match shifted;
retain found;
merge a (in=ina) c(in=inc);
by id date;
if shift=-7 then found = 0;
if ina and inc then do;
if shift = 0
then output match;
else output shifted;
found = 1;
end;
else if ina and not inc then output aa;
else if shift=7 and not found then output bb;
drop found shift;
run;
Thank you PG,
There is a possibility that consecutive dates are less than 7 days. Since I am dealing with thousands of rows.
How do I not run into that problem?
Thanks
If two observations are less than 7 days apart it is very likely that they will both match the same date from the other dataset. Let's say that you have dates April 1, 2, and 3 in dataA and April 5, 6, 7 and 8 in dataB. You need a rule to choose 3 out of 12 possible matches. The problem can't be solved unless it is completely defined.
Since both datasets are sorted by four keys and you are looking for exact match on the first 3 keys, and a difference <=7 on the last key, you can do this in a single DATA step. But first note that since you want cases where the DATE can differ between a and b, you need to name the variables as DATE_A and DATE_B:
data A;
input client_id hosp_id hosp_loc date_A ;
informat date mmddyy10.;
format date mmddyy10.;
cards;
1 123 1234 10/1/2015
2 123 2345 10/1/2015
3 123 4325 2/3/2015
4 156 5678 5/6/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 178 3456 1/31/2017
8 178 4321 1/31/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017
run;
data B;
input client_id hosp_id hosp_loc date_B XB;
informat date mmddyy10.;
format date mmddyy10.;
cards;
1 123 1234 10/1/2015
2 123 2345 10/5/2015
3 123 4325 2/1/2015
4 156 5678 5/20/2016
5 167 2345 10/11/2016
6 167 8907 1/4/2017
7 138 8643 1/31/2017
8 178 4321 2/5/2017
9 145 5467 7/23/2017
10 134 2398 8/24/2017
11 967 2897 9/13/2017
13 167 4567 12/31/2016
14 145 3345 11/23/2016
15 123 1234 11/12/2016
16 999 9999 3/24/2016
17 178 4321 2/3/2016
run;
proc sql noprint;
select name into :avars separated by ' '
from dictionary.columns where libname='WORK' and memname='A';
select name into :bvars separated by ' '
from dictionary.columns where libname='WORK' and memname='B';
quit;
data match (keep=&avars &bvars) aonly (keep=&avars);
set a end=end_of_a;
if _n_=1 then do;
if 0 then set b;
declare hash bdata (dataset:'b',ordered:'A');
bdata.definekey ('client_id','hosp_id','hosp_loc','date_b');
bdata.definedata(all:'Y');
bdata.definedone();
end;
date_b=date_a;
rc=bdata.find();
if rc^=0 then do delta=1 to 7 until(rc=0); /*If no exact match, start at one day apart*/
date_b=date_a-delta; /*Check for date_b PRECEDING date_a */
rc=bdata.find();
if rc=0 then leave;
date_b=date_a+delta; /*Otherwise check for date_b FOLLOWING date_a*/
rc=bdata.find();
end;
if rc=0 then do;
output match;
rc=bdata.remove();
end;
else output aonly;
if end_of_a then rc=bdata.output(dataset:'bonly');
run;
Thank you!
I will try this and thank you everyone for your suggestions and help.
I did try a method that required a few steps.
That is, in the two tables that I wanted to match I created a primary key of sorts for each row i.e id=_n_ which I would carry as I run an SQL query by the conditions I wanted to match the two data sets along with the time frame. This SQL creates a table with only the matched rows. Since I had the column with the primary key in this matched table I can go back to the original tables and identify which two rows were matched. I then run another SQL where I take out the matched rows from the original table. This method is not perfect as there are some duplicates, but it gets to what I need.
Thank you all for your assistance!!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.