Hi All,
So I am working with two different datasets which have data coming in from multiple facilities. Both datasets report much of the same information, but have some differences (including entries that are available in one, but not the other). We are currently working on switching from using one of these datasets to the other for a number of products, and I am having to perform some QA on them.
I want to create two subsets of the data, where I only pull entries that occur in both datasets (and would also like to do the opposite, to determine which data is appearing in one but not the other and why).
Each entry has a "unique" identifier number, but unfortunately they can appear more than once because each of the facilities generates this number indipendantly of any other facility, and sometimes they end up being the same by chance. Because if this, I can't simply match based on the "unique" identifier alone.
I used the following code to match based on the identifier, but need to tell it to also match the facility WITHIN THE SAME ROW:
proc sql noprint; create table SET1_MATCHED as select * from ATDED.SET1 where EPISODE_NR in (select EPISODE_NR from SET2) by facility ; quit; proc sql noprint; create table SET2_MATCHED as select * from ATDED.SET2 where EPISODE_NR in (select EPISODE_NR from SET1) by facility ; quit;
Unfortunately, I can't simply add "and FACILITY in(select FACILITY from SET1/2)" because it won't search within the same row, it will just check if the facility is within the other data set at all, and therefore produces identical subsets of the data to ust excluding this line.
Any suggestions?
(Note: I could approach this a little differently and merge the datasets, but would prefer to keep them separate if possible. I could also add in a "and FACILITY = facility_name" and create a subset for each facility and append the sets back together, but there are a lot of facilities and I figured there had to be a more efficient way. ).
Just in case anyone is actually following this (probably not), here is how I solved both the matching and the non-matching data.
Matching:
proc sql noprint; create table SET1_MATCHED as select distinct a. * from SET1 a, SET2 b where a.EPISODE_NR = b.EPISODE_NR and a.FACILITY = b.FACILITY ; quit; (REPEATED FOR SET2)
Non-Matching: It involves 3 steps for each data set. It isn't the most elegent and there is probably a more efficient way to do it, but it works.
/*Create Identifier Datasets*/ proc sql noprint; create table SET1_IDENTIFIERS as select EPISODE_NR, FACILITY from SET1 ; quit; (REPEAT FOR SET2) /*Create Identifier Datasets for NON MATCHED DATA*/ proc sql noprint; create table SET1_NONMATCHED_IDENTIFIERS as select EPISODE_NR, FACILITY from SET1_IDENTIFIERS except select EPISODE_NR, FACILITY from SET2_IDENTIFIERS ; quit; (REPEAT FOR SET2) Proc sql noprint; create table SET1_NOMMATCHED as select * from SET1_NONMATCHED_IDENTIFIERS as a, SET1 as b where a.EPISODE_NR=b.EPISODE_NR and a.FACILITY=b.FACILITY ; quit; (REPEAT FOR SET2)
Please post a sample of both datasets that demostarate the problems and use as test data.
You might be looking at an Inner Join on EPISODE_NR and Facility to start.
proc sql;
create table both as
select a.EPISODE_NR, a.Facility
from (select EPISODE_NR, Facility from SET1) as a
inner join
(select EPISODE_NR, Facility from SET2) as b
on a.EPISODE_NR=b.EPISODE_NR and a.Facility=b.facility
;
quit;
This would give you the Episode_nr and Facility that match in both sets.
Then something like this would yeild the ones that are only in the first set but not the second, and in the second but not the first:
proc sql; create table OnlyIn1 as select * from(select EPISODE_NR, Facility from SET1) except select * from both; create table OnlyIn2 as select * from(select EPISODE_NR, Facility from SET2) except select * from both; quit;
If your data contains duplicates of EPISODE_NR for some facilities then you may want to consider adding DISTINCT to most of those selects.
You would now have 3 non-overlapping sets of your identification combinations and you could extract the specific sets as needed (hopefully) using Left or Inner Joins.
The code just pulls identifiers. What I was thinking about was if the Episode_nr occurend many times, like account_number within a bank. There would be the potential to generate lots of duplicates for each match, which is redundant information.
Once you determine that episode XX23 occurs in facility ABC from just data set 1 then you can use that infomation to select ALL the records from data set 1 that have the combination XX23 and ABC.
This does not yield the records you want but
might be useful for debugging. You may be able to alter it to get records.
Venn diagram analysis of the non unique identifiers in two datasets
Up to 5 dimensional ven diagrams are availiable along with Ian Whitlocks
sets of overlaps algorithm.
May only work with relatively small datasets (less that a million identifiers)
inspired by
https://goo.gl/XC0SKq
https://communities.sas.com/t5/SAS-Data-Management/Creating-subset-of-data-by-matching-observations-to-second-data/m-p/318359
HAVE two datasets with a common but non unique identifiers
======================================================
Up to 40 obs WORK.SET1 total obs=84,151
EPISODE_
Obs NR
1 250000
2 250003
3 250006
4 250009
5 250012
6 250012
....
Up to 40 obs WORK.SET2 total obs=200,818
EPISODE_
Obs NR
1 1000
2 1005
3 1005
4 1015
5 1020
6 1025
7 1030
8 1035
....
WANT the venn diagram showing number of
Set A is Element=EPISODE_NR
Set B is Element=EPISODE_NR
A Union B
284,969 Total
266,468 Distincts
SET1 SET2
84,151 200,818 Totals
83,334 199,801 Distincts
*8888888888* *88888888*
8 8 8 8
8 8 8 8
8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 Distincts 8 Distincts 8 Distincts 8
8 66,667 8 16,667 8 183,134 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8 8
8 8 8
8 8 8 8
*8888888888* *8888888888*
%macro utl_ven2
(
utitle=Venn diagram for two tables,
uinmem1=work.EpoNspMqs, /* first input SAS table */
uinmem2=work.EpoNspM_f, /* second input SAS table */
uvar1=qprikey, /* field names do not have to be the same */
uvar2=mprikey
) / des="VENN diagram two datasets";
%put %sysfunc(ifc(%sysevalf(%superq(uinmem1)=,boolean),**** Please Provide 1st SAS dataset ****,));
%put %sysfunc(ifc(%sysevalf(%superq(uinmem2)=,boolean),**** Please Provide 2nd SAS dataset ****,));
%put %sysfunc(ifc(%sysevalf(%superq(uvar1)=,boolean),**** Please Provide 1st Keys ****,));
%put %sysfunc(ifc(%sysevalf(%superq(uvar2)=,boolean),**** Please Provide 2nd Keys ****,));
%if %eval(
%sysfunc(ifc(%sysevalf(%superq(uinmem1)=,boolean),1,0)) +
%sysfunc(ifc(%sysevalf(%superq(uinmem2)=,boolean),1,0)) +
%sysfunc(ifc(%sysevalf(%superq(uvar1)=,boolean),1,0)) +
%sysfunc(ifc(%sysevalf(%superq(uvar2)=,boolean),1,0))
) eq 0 %then %do;
/* for testing without macro
data class1;set sashelp.class;if mod(_n_,2)=0; if _n_=11 then name='Joyce';run;
data class2;set sashelp.class;if mod(_n_,3)=0; if _n_=1 then name='Major';run;
%let uinmem1=class1;
%let uinmem2=class2;
%let uvar1=name;
%let uvar2=name;
*/
%let utl_lvlonea =%upcase(%sysfunc(ifc(%index(&uinmem1,%str(.))=0,WORK,%upcase(%scan(&uinmem1,1,%str(.))))));
%let utl_lvltwoa =%upcase(%sysfunc(ifc(%index(&uinmem1,%str(.))=0,&uinmem1,%upcase(%scan(&uinmem1,2,%str(.))))));
%let utl_lvlonetwoa= &utl_lvlone..&utl_lvltwo;
%let utl_lvloneb =%upcase(%sysfunc(ifc(%index(&uinmem2,%str(.))=0,WORK,%upcase(%scan(&uinmem2,1,%str(.))))));
%let utl_lvltwob =%upcase(%sysfunc(ifc(%index(&uinmem2,%str(.))=0,&uinmem2,%upcase(%scan(&uinmem2,2,%str(.))))));
%let utl_lvlonetwob= &utl_lvlone..&utl_lvltwo;
%let uinmema = %upcase( &utl_lvltwoa);
%let uinmemb = %upcase( &utl_lvltwob);
%let uvara = %upcase( &uvar1);
%let uvarb = %upcase( &uvar2);
/*----------------------------------------------*\
| SQL code to get counts |
| udsta = distinct values is set a |
| udstb = distinct values is set b |
| |
| unota = values not in a |
| unotb = values not in b |
| |
| uaib = values in intersection |
\*----------------------------------------------*/
proc sql noprint;
select memlabel , nobs
into :ulaba , :uina
from dictionary.tables
where libname="&utl_lvlonea" and memname="&uinmema";
select memlabel , nobs
into :ulabb , :uinb
from dictionary.tables
where libname="&utl_lvloneb" and memname="&uinmemb";
select count(distinct &uvara) into :udsta
from &uinmema;
select count(distinct &uvarb) into :udstb
from &uinmemb;
select count(distinct &uvara) into :unotb
from &uinmema
where &uvara not in
(select &uvarb as &uvara
from &uinmemb);
select count(distinct &uvarb) into :unota
from &uinmemb
where &uvarb not in
(select &uvara as &uvarb
from &uinmema);
select count(distinct &uvara) into :uaib
from &uinmema
where &uvara in
(select &uvarb as &uvara
from &uinmemb);
quit;
run;
%let udstab=%eval(&unota + &unotb + &uaib); /* total distinct*/
%let uaub=%eval(&uina + &uinb); /* a union b */
%put unota=&unota; /* not in a */
%put unotb=&unotb; /* not in b */
%put uaib=&uaib; /* a intersect b */
%put uina=&uina; /* in a */
%put uinb=&uinb; /* in b */
data _null_;
file print;
uaub=%eval(&uina + &uinb);
unota=&unota;
unotb=&unotb;
udsta=&udsta;
udstb=&udstb;
udstab=&udstab;
uaib=&uaib;
uina=&uina;
uinb=&uinb;
put #02 @10 "Set A is %trim(%left(&ulaba))" @53 "Element=%trim(%left(&uvara))";
put #03 @10 "Set B is %trim(%left(&ulabb))" @53 "Element=%trim(%left(&uvarb))";
put #05 @26 " A Union B "@;
put #06 @22 uaub comma12. @;
put #07 @22 udstab comma12.@;
put #08 @20 "&uinmema" @35 "&uinmemb"@;
put #09 @16 uina comma12. @31 uinb comma12.;
put #10 @16 udsta comma12. @31 udstb comma12.;
put #12 @3 " *8888888888* *88888888* "@;
put #13 @3 " 8 8 8 8 "@;
put #14 @3 " 8 8 8 8 "@;
put #15 @3 " 8 8 8 "@;
put #16 @3 " 8 8 8 8 "@;
put #17 @3 " 8 8 8 8 "@;
put #18 @3 " 8 8 8 8 "@;
put #19 @3 " 8 8 8 8 "@;
put #20 @3 " 8 8 8 8 "@;
put #21 @3 " 8 8 8 8 "@;
put #22 @3 " 8 8 8 8 "@;
put #23 @3 " 8 8 8 8 "@;
put #23 @10 unotb comma12. @22 uaib comma12. @39 unota comma12. @;
put #23 @3 " 8"@;
put #24 @3 " 8 8 8 8 "@;
put #25 @3 " 8 8 8 8 "@;
put #26 @3 " 8 8 8 8 "@;
put #27 @3 " 8 8 8 8 "@;
put #28 @3 " 8 8 8 8 "@;
put #29 @3 " 8 8 8 8 "@;
put #30 @3 " 8 8 8 8 "@;
put #31 @3 " 8 8 8 8 "@;
put #32 @3 " 8 8 8 8 "@;
put #33 @3 " 8 8 8 "@;
put #34 @3 " 8 8 8 8 "@;
put #35 @3 " *8888888888* *8888888888* "@;
put #36 @3 " "@;
stop;
run;
%end;
%mend utl_ven2;
data set1;
do episode_nr = 250000 to 500000 by 3;
if uniform(-1)<.01 then output; /*inject an dup error */
output;
end;
run;quit;
data set2;
do episode_nr = 1000 to 1000000 by 5;
if uniform(-1)<.005 then output; /*inject an dup error */
output;
end;
run;quit;
%utl_ven2
(
uinmem1=set1,
uinmem2=set2,
uvar1=episode_nr,
uvar2=episode_nr
);
Alright, so unfortunately @ballardw this code didn't work out quite as I think you or I had expected it to.
It ended up producing a dataset with a little over 33 times as many entries as either of the two datasets from which it was being drawn (over 100 million entries). As you can imagine, the vast majority of these were duplicates.
I tried running a Proc SQL command to remove duplicates, but the set is so large it failed because the SAS work directory ran out of space (meaning the temporary file size got above 30GB).
I created a subset of the data to continue playing around with this until it is figured out.
Currently running the following code (plus the same thing with the data sets reversed):
proc sql noprint; create table MATCHED1 as select a. * from SET1 a, SET2 b where a.EPISODE_NR = b.EPISODE_NR and a.FACILITY = b.FACILITY ; quit;
But it's the end of the day so I'll have to wait to find out if it worked when I get back in tomorrow morning.
Post some small examples of the data, or something that's similar and what you expect.
Note that I did mention that you may want to use DISTINCT in the selects to identify unique combinations of values. The muliplication of records indicates multiple matches..
Alright, so I made up two very simplified dummy data sets and attached them here.
I tested the code that I posted yesterday (shown again below), and it works (the matched dataset attached was made using this code). I ended up having to cancel it on the full dataset because it was still running after 18 hours and I had to free up system resources to run other code. But at least I now know that it works.
proc sql noprint; create table DUMMY_MATCHED as select distinct a. * from DUMMY_SET_1 a, DUMMY_SET_2 b where a.EPISODE_NR = b.EPISODE_NR and a.FACILITY = b.FACILITY ; quit;
Now I just need to figure out how to get non-matching entries (one data set for each input set).
I thought the following code might work, but nope .It pulls most but not all of the non-matching entries, and it also pulls in several matching entries and I can't figure out why. I have attached the resulting non-matched data sets produced by this code.
proc sql noprint; create table DUMMY_NON_MATCHED as select DISTINCT a. * from DUMMY_SET_1 a, DUMMY_SET_2 b where (a.EPISODE_NR = b.EPISODE_NR and a.FACILITY NE b.FACILITY) or a.EPISODE_NR NOT IN (select EPISODE_NR from DUMMY_SET_2) ; quit;
I tried another method to find entries that occur in one data set but not the other, and it appears to function exactly as I need it to with the dummy dataset. Unfortunately, it doesn't when I use the full datasets because the placement, name, type, and number of variables differ, so it finds no matches and once again just creates a new dataset containing all the same entries as the original (or fails entirely).
proc sql noprint; create table DUMMY_NON_MATCHED_EXC as select * from DUMMY_SET_1 except select * from DUMMY_SET_2 ; quit;
Just in case anyone is actually following this (probably not), here is how I solved both the matching and the non-matching data.
Matching:
proc sql noprint; create table SET1_MATCHED as select distinct a. * from SET1 a, SET2 b where a.EPISODE_NR = b.EPISODE_NR and a.FACILITY = b.FACILITY ; quit; (REPEATED FOR SET2)
Non-Matching: It involves 3 steps for each data set. It isn't the most elegent and there is probably a more efficient way to do it, but it works.
/*Create Identifier Datasets*/ proc sql noprint; create table SET1_IDENTIFIERS as select EPISODE_NR, FACILITY from SET1 ; quit; (REPEAT FOR SET2) /*Create Identifier Datasets for NON MATCHED DATA*/ proc sql noprint; create table SET1_NONMATCHED_IDENTIFIERS as select EPISODE_NR, FACILITY from SET1_IDENTIFIERS except select EPISODE_NR, FACILITY from SET2_IDENTIFIERS ; quit; (REPEAT FOR SET2) Proc sql noprint; create table SET1_NOMMATCHED as select * from SET1_NONMATCHED_IDENTIFIERS as a, SET1 as b where a.EPISODE_NR=b.EPISODE_NR and a.FACILITY=b.FACILITY ; quit; (REPEAT FOR SET2)
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.