I have 2 datasets that I need to compare records to see if variable values from each match.
If they match on the specific variables I assign a Rule number 1 - 6.
There is a hierarchy so if the record matches on rule 1 then it does not need to be evaluated again.
There will also be records that do not match that are contained in the main dataset and need to be retained.
MATCHING HIERACHY:
1) pendingno = refno and lastname = last_name then rule = 1 and add recevdate (rule and recevdate are added to the data from the main_data record)
2) pendingno = refno then rule = 2 and add recevdate (rule and recevdate are added to the data from the main_data record)
3) acct = acctno and lastname = last_name then rule = 3 and add recevdate (rule and recevdate are added to the data from the main_data record)
4) acct = acctno then rule = 4 and add recevdate (rule and recevdate are added to the data from the main_data record)
5) id = sid then rule = 5 and add recevdate (rule and recevdate are added to the data from the main_data record)
6) lastname = last_name and firstname = first_name then rule = 6 and add recevdate (rule and recevdate are added to the data from the main_data record)
There will be records that don't have any matches from the main_data they will just have blanks for rule and recevdate.
Have:
Main Data:
data work.main_data;
infile datalines dlm=',' dsd;
length lastname firstname $25 store $5 pendingno $13 id $10 acct $16;
input lastname firstname pendingno id acct $ assigndate :date9.;
format assigndate date9.;
datalines;
JONES, JACK, 05522, A123456B85946, 079458678, 4569874589375158, 04JUN2021
LYONS, LOIS, 04437, A123357B85946, 129458646, 6559874589375321, 22JAN2022
WILLIAMS, JOHN, 05522, A123478C25946, 379429678, 5429874589374758, 07AUG2021
BETZINA, ANN, 04437, D983456B85946, 469457578, 4569874589375894, 15OCT2021
JOYES, JACK, 00301, A123456B85946, 279258678, 3729874589375158, 04SEP2021
SAMS, MACK, 02747, A128547B85946, 654458646, 455874589375321, 27MAY2022
WILLIS, JANE, 00301, A123458X25946, 379447878, 5412345589374758, 13AUG2021
BETXINA, RUTH, 04437, D983426B85946, 269453578, 4569874589387651, 25DEC2021
DAVIS, DAVID, 05522, B123458X25136, 379442778, 5418445589374758, 13APR2021
BATHE, ROBERT, 04437, D983716B85946, 264853578, 4569891589387657, 25MAR2022
;
run;
Match to Data:
data work.match_data;
infile datalines dlm=',' dsd;
length last_name first_name $25 refno $13 sid $10 acctno $16;
input last_name first_name refno sid acctno $ recevdate :date9.;
format recevdate date9.;
datalines;
JONES, JACK, A123456B85946, 079458678, 4569874589375158, 05JUN2021
LYONS, LOIS, A123357B85946, , 6559874589375321, 23JAN2022
WILLIAMS, JAMES, , 379429678, 5429874589374758, 09AUG2021
BETZINA, ANN, D983456B85946, 469457578, 4569874589375158, 18OCT2021
JOYES, JACK, A123456B85946, 279258678, 3729874589375158, 07SEP2021
SAMS, MACK, A128547B85946, 654458646, , 28MAY2022
WILLS, JADE, A123458X25946, 379447878, 5412345589374758, 15AUG2021
BETXINA, RUTH, D983426B85946, 269453578, 4569874589387651, 27DEC2021
WILLIAMS, JOHN, A123478C25946, 379429678, 5429874589374758, 08AUG2021
;
run;
I am at a loss as to how to do the matching with the hierarchy and get the output I need to summarize all the records from the main_data.
I have tried data steps and proc sql but there is not just 1 field in both that can join the records so I cannot seem to wrap my head around the matching hierarchy and how to get the right result.
Then I have to take the data and summarize based on store with counts for how many records in each rule or how many mismatched.
Any assistance will be greatly appreciated.
Thanks,
Elliott
It is not clear what you want to compare: each obs in "main_data" with each obs in "match_data", or just first obs with first obs, second with second ... ?
If you want the later, use merge without by, rename the variables in the second dataset and use if-then-else to set rule_number.
I want to evaluate every record in main_data for matches to match_data if there is a match then set rule value based on the hierarchy, so if the match is the #1 match then that is the highest level, if it is not matched to 1 then evaluate for a match to #2 etc. so the highest level match should be retained in the rule variable. The only variable out of the match_data that needs to be captured on the record is the date value.
Hello @Elliott,
For each of the six rules you could create an index on dataset MATCH_DATA and then use SET statements with the KEY= option until a match is found or all rules have been applied unsuccessfully:
proc datasets lib=work nolist;
modify match_data;
index create reflast=(refno last_name);
index create refno;
index create acclast=(acctno last_name);
index create acctno;
index create sid;
index create name=(last_name first_name);
quit;
data want(drop=last_name first_name refno sid acctno);
set main_data;
refno=pendingno;
last_name=lastname;
set match_data key=reflast/unique;
if _iorc_=0 then rule=1;
else do;
set match_data key=refno/unique;
if _iorc_=0 then rule=2;
else do;
acctno=acct;
set match_data key=acclast/unique;
if _iorc_=0 then rule=3;
else do;
set match_data key=acctno/unique;
if _iorc_=0 then rule=4;
else do;
sid=id;
set match_data key=sid/unique;
if _iorc_=0 then rule=5;
else do;
first_name=firstname;
set match_data key=name/unique;
if _iorc_=0 then rule=6;
else recevdate=.;
end;
end;
end;
end;
end;
_error_=0;
run;
I'm not sure, though, of the performance and disk space requirements if MATCH_DATA is very large.
I tried this query on my sample data and it was taking so long to run on just 12 sample records I had to stop it, so I don't want to try with 10000+ records. But Thank you, I have learned something new.
@Elliott wrote:
I tried this query on my sample data and it was taking so long to run on just 12 sample records I had to stop it, so I don't want to try with 10000+ records. But Thank you, I have learned something new.
If the match table got only some 10000+ records then a hash approach as below should work.
data work.main_data;
infile datalines dlm=',' dsd truncover;
length lastname firstname $25 store $5 pendingno $13 id $10 acct $16;
input lastname firstname store pendingno id acct $ assigndate :date9.;
format assigndate date9.;
datalines;
JONES,JACK,05522,A123456B85946,079458678,4569874589375158,04JUN2021
LYONS,LOIS,04437,A123357B85946,129458646,6559874589375321,22JAN2022
WILLIAMS,JOHN,05522,A123478C25946,379429678,5429874589374758,07AUG2021
BETZINA,ANN,04437,D983456B85946,469457578,4569874589375894,15OCT2021
JOYES,JACK,00301,A123456B85946,279258678,3729874589375158,04SEP2021
SAMS,MACK,02747,A128547B85946,654458646,455874589375321,27MAY2022
WILLIS,JANE,00301,A123458X25946,379447878,5412345589374758,13AUG2021
BETXINA,RUTH,04437,D983426B85946,269453578,4569874589387651,25DEC2021
DAVIS,DAVID,05522,B123458X25136,379442778,5418445589374758,13APR2021
BATHE,ROBERT,04437,D983716B85946,264853578,4569891589387657,25MAR2022
;
data work.match_data;
infile datalines dlm=',' dsd truncover;
length last_name first_name $25 refno $13 sid $10 acctno $16;
input last_name first_name refno sid acctno $ recevdate :date9.;
format recevdate date9.;
datalines;
JONES,JACK,A123456B85946,079458678,4569874589375158,05JUN2021
LYONS,LOIS,A123357B85946,,6559874589375321,23JAN2022
WILLIAMS,JAMES,,379429678,5429874589374758,09AUG2021
BETZINA,ANN,D983456B85946,469457578,4569874589375158,18OCT2021
JOYES,JACK,A123456B85946,279258678,3729874589375158,07SEP2021
SAMS,MACK,A128547B85946,654458646,,28MAY2022
WILLS,JADE,A123458X25946,379447878,5412345589374758,15AUG2021
BETXINA,RUTH,D983426B85946,269453578,4569874589387651,27DEC2021
WILLIAMS,JOHN,A123478C25946,379429678,5429874589374758,08AUG2021
;
data work.v_match_data /view=work.v_match_data;
set work.match_data;
rename
refno=pendingno
last_name=lastname
first_name=firstname
acctno = acct
sid = id
;
run;
data want;
if _n_=1 then
do;
if 0 then set work.v_match_data(keep=recevdate);
dcl hash h_rule1(dataset:'work.v_match_data');
h_rule1.defineKey('pendingno','lastname');
h_rule1.defineData('recevdate');
h_rule1.defineDone();
dcl hash h_rule2(dataset:'work.v_match_data');
h_rule2.defineKey('pendingno');
h_rule2.defineData('recevdate');
h_rule2.defineDone();
/*** and so on ***/
end;
call missing(of _all_);
set work.main_data;
if h_rule1.find()=0 then;
else if h_rule2.find()=0 then;
/*** and so on ***/
run;
Hi, this code works but I also need it to assign a value in a new variable called Rule, which is a 1. numeric that populates with 1 - 6 based on the matching rule. Then I also need any that do not match from the main file to be retained and assign Rule=0;
I tried a few ways to add to the code you provided but was not able to create the new variable. What modifications are necessary to create rule assignment and retain the unmatched records?
Thanks,
Like this?
Hi Patrick,
I have a question... how do I avoid having matches to missing acct /acctno from rule #4?
Thanks,
it looks like the matching on missing values is also happening with acct/lastname and id. So my results are showing higher matches then there actually are.
I have been researching hash objects for several hours and I have not been able to figure out how to exclude the matching on missing values.
Thanks,
Just exclude missings from the lookup and you will never get a match to missings.
data want;
if _n_=1 then
do;
if 0 then set work.v_match_data(keep=recevdate);
dcl hash h_rule1(dataset:'work.v_match_data');
h_rule1.defineKey('pendingno','lastname');
h_rule1.defineData('recevdate');
h_rule1.defineDone();
dcl hash h_rule2(dataset:'work.v_match_data');
h_rule2.defineKey('pendingno');
h_rule2.defineData('recevdate');
h_rule2.defineDone();
dcl hash h_rule3(dataset:'work.v_match_data');
h_rule3.defineKey('acct','lastname');
h_rule3.defineData('recevdate');
h_rule3.defineDone();
dcl hash h_rule4(dataset:'work.v_match_data(where=(not missing(acct))');
h_rule4.defineKey('acct');
h_rule4.defineData('recevdate');
h_rule4.defineDone();
/*** and so on ***/
end;
call missing(of _all_);
set work.main_data;
if h_rule1.find()=0 then rule=1;
else if h_rule2.find()=0 then rule=2;
else if h_rule3.find()=0 then rule=3;
else if h_rule4.find()=0 then rule=4;
/*** and so on ***/
else rule=0;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.