Solved: I need to match data from 2 separate data sets based on multiple varia...

Elliott · Posted 07-15-2022 12:31 AM

I have 2 datasets that I need to compare records to see if variable values from each match.

If they match on the specific variables I assign a Rule number 1 - 6.

There is a hierarchy so if the record matches on rule 1 then it does not need to be evaluated again.

There will also be records that do not match that are contained in the main dataset and need to be retained.

MATCHING HIERACHY:
1) pendingno = refno and lastname = last_name then rule = 1 and add recevdate (rule and recevdate are added to the data from the main_data record)
2) pendingno = refno then rule = 2 and add recevdate (rule and recevdate are added to the data from the main_data record)
3) acct = acctno and lastname = last_name then rule = 3 and add recevdate (rule and recevdate are added to the data from the main_data record)
4) acct = acctno then rule = 4 and add recevdate (rule and recevdate are added to the data from the main_data record)
5) id = sid then rule = 5 and add recevdate (rule and recevdate are added to the data from the main_data record)
6) lastname = last_name and firstname = first_name then rule = 6 and add recevdate (rule and recevdate are added to the data from the main_data record)

There will be records that don't have any matches from the main_data they will just have blanks for rule and recevdate.

Have:

Main Data:

data work.main_data;
infile datalines dlm=',' dsd;
length lastname firstname $25 store $5 pendingno $13 id $10 acct $16;
input lastname firstname pendingno id acct $ assigndate :date9.;
format assigndate date9.;
datalines;

JONES, JACK, 05522, A123456B85946, 079458678, 4569874589375158, 04JUN2021
LYONS, LOIS, 04437, A123357B85946, 129458646, 6559874589375321, 22JAN2022
WILLIAMS, JOHN, 05522, A123478C25946, 379429678, 5429874589374758, 07AUG2021
BETZINA, ANN, 04437, D983456B85946, 469457578, 4569874589375894, 15OCT2021
JOYES, JACK, 00301, A123456B85946, 279258678, 3729874589375158, 04SEP2021
SAMS, MACK, 02747, A128547B85946, 654458646, 455874589375321, 27MAY2022
WILLIS, JANE, 00301, A123458X25946, 379447878, 5412345589374758, 13AUG2021
BETXINA, RUTH, 04437, D983426B85946, 269453578, 4569874589387651, 25DEC2021
DAVIS, DAVID, 05522, B123458X25136, 379442778, 5418445589374758, 13APR2021
BATHE, ROBERT, 04437, D983716B85946, 264853578, 4569891589387657, 25MAR2022
;
run;

Match to Data:

data work.match_data;
infile datalines dlm=',' dsd;
length last_name first_name $25 refno $13 sid $10 acctno $16;
input last_name first_name refno sid acctno $ recevdate :date9.;
format recevdate date9.;
datalines;

JONES, JACK, A123456B85946, 079458678, 4569874589375158, 05JUN2021
LYONS, LOIS, A123357B85946, , 6559874589375321, 23JAN2022
WILLIAMS, JAMES, , 379429678, 5429874589374758, 09AUG2021
BETZINA, ANN, D983456B85946, 469457578, 4569874589375158, 18OCT2021
JOYES, JACK, A123456B85946, 279258678, 3729874589375158, 07SEP2021
SAMS, MACK, A128547B85946, 654458646, , 28MAY2022
WILLS, JADE, A123458X25946, 379447878, 5412345589374758, 15AUG2021
BETXINA, RUTH, D983426B85946, 269453578, 4569874589387651, 27DEC2021
WILLIAMS, JOHN, A123478C25946, 379429678, 5429874589374758, 08AUG2021

;

run;

I am at a loss as to how to do the matching with the hierarchy and get the output I need to summarize all the records from the main_data.

I have tried data steps and proc sql but there is not just 1 field in both that can join the records so I cannot seem to wrap my head around the matching hierarchy and how to get the right result.

Then I have to take the data and summarize based on store with counts for how many records in each rule or how many mismatched.

Any assistance will be greatly appreciated.

Thanks,

Elliott

Patrick · Posted 07-17-2022 11:06 AM

Like this?

View solution in original post

andreas_lds · Posted 07-15-2022 02:02 AM

It is not clear what you want to compare: each obs in "main_data" with each obs in "match_data", or just first obs with first obs, second with second ... ?

If you want the later, use merge without by, rename the variables in the second dataset and use if-then-else to set rule_number.

Elliott · Posted 07-15-2022 06:12 AM

I want to evaluate every record in main_data for matches to match_data if there is a match then set rule value based on the hierarchy, so if the match is the #1 match then that is the highest level, if it is not matched to 1 then evaluate for a match to #2 etc. so the highest level match should be retained in the rule variable. The only variable out of the match_data that needs to be captured on the record is the date value.

FreelanceReinh · Posted 07-15-2022 01:33 PM

Hello @Elliott,

For each of the six rules you could create an index on dataset MATCH_DATA and then use SET statements with the KEY= option until a match is found or all rules have been applied unsuccessfully:

proc datasets lib=work nolist;
modify match_data;
  index create reflast=(refno last_name);
  index create refno;
  index create acclast=(acctno last_name);
  index create acctno;
  index create sid;
  index create name=(last_name first_name);
quit;

data want(drop=last_name first_name refno sid acctno);
set main_data;
refno=pendingno;
last_name=lastname;
set match_data key=reflast/unique;
if _iorc_=0 then rule=1;
else do;
  set match_data key=refno/unique;
  if _iorc_=0 then rule=2;
  else do;
    acctno=acct;
    set match_data key=acclast/unique;
    if _iorc_=0 then rule=3;
    else do;
      set match_data key=acctno/unique;
      if _iorc_=0 then rule=4;
      else do;
        sid=id;
        set match_data key=sid/unique;
        if _iorc_=0 then rule=5;
        else do;
          first_name=firstname;
          set match_data key=name/unique;
          if _iorc_=0 then rule=6;
          else recevdate=.;
        end;
      end;
    end;
  end;
end;
_error_=0;
run;

I'm not sure, though, of the performance and disk space requirements if MATCH_DATA is very large.

Elliott · Posted 07-15-2022 07:33 PM

I tried this query on my sample data and it was taking so long to run on just 12 sample records I had to stop it, so I don't want to try with 10000+ records. But Thank you, I have learned something new.

Patrick · Posted 07-15-2022 11:40 PM

@Elliott wrote:

I tried this query on my sample data and it was taking so long to run on just 12 sample records I had to stop it, so I don't want to try with 10000+ records. But Thank you, I have learned something new.

If the match table got only some 10000+ records then a hash approach as below should work.

data work.main_data;
  infile datalines dlm=',' dsd truncover;
  length lastname firstname $25 store $5 pendingno $13 id $10 acct $16;
  input lastname firstname store pendingno id acct $ assigndate :date9.;
  format assigndate date9.;
  datalines;
JONES,JACK,05522,A123456B85946,079458678,4569874589375158,04JUN2021
LYONS,LOIS,04437,A123357B85946,129458646,6559874589375321,22JAN2022
WILLIAMS,JOHN,05522,A123478C25946,379429678,5429874589374758,07AUG2021
BETZINA,ANN,04437,D983456B85946,469457578,4569874589375894,15OCT2021
JOYES,JACK,00301,A123456B85946,279258678,3729874589375158,04SEP2021
SAMS,MACK,02747,A128547B85946,654458646,455874589375321,27MAY2022
WILLIS,JANE,00301,A123458X25946,379447878,5412345589374758,13AUG2021
BETXINA,RUTH,04437,D983426B85946,269453578,4569874589387651,25DEC2021
DAVIS,DAVID,05522,B123458X25136,379442778,5418445589374758,13APR2021
BATHE,ROBERT,04437,D983716B85946,264853578,4569891589387657,25MAR2022
;

data work.match_data;
  infile datalines dlm=',' dsd truncover;
  length last_name first_name $25 refno $13 sid $10 acctno $16;
  input last_name first_name refno sid acctno $ recevdate :date9.;
  format recevdate date9.;
  datalines;
JONES,JACK,A123456B85946,079458678,4569874589375158,05JUN2021
LYONS,LOIS,A123357B85946,,6559874589375321,23JAN2022
WILLIAMS,JAMES,,379429678,5429874589374758,09AUG2021
BETZINA,ANN,D983456B85946,469457578,4569874589375158,18OCT2021
JOYES,JACK,A123456B85946,279258678,3729874589375158,07SEP2021
SAMS,MACK,A128547B85946,654458646,,28MAY2022
WILLS,JADE,A123458X25946,379447878,5412345589374758,15AUG2021
BETXINA,RUTH,D983426B85946,269453578,4569874589387651,27DEC2021
WILLIAMS,JOHN,A123478C25946,379429678,5429874589374758,08AUG2021
;

data work.v_match_data /view=work.v_match_data;
  set work.match_data;
  rename 
    refno=pendingno
    last_name=lastname
    first_name=firstname
    acctno = acct
    sid = id
    ;
run;

data want;
  if _n_=1 then
    do;
      if 0 then set work.v_match_data(keep=recevdate);
      dcl hash h_rule1(dataset:'work.v_match_data');
      h_rule1.defineKey('pendingno','lastname');
      h_rule1.defineData('recevdate');
      h_rule1.defineDone();

      dcl hash h_rule2(dataset:'work.v_match_data');
      h_rule2.defineKey('pendingno');
      h_rule2.defineData('recevdate');
      h_rule2.defineDone();
      /*** and so on ***/

    end;
  call missing(of _all_);

  set work.main_data;

  if h_rule1.find()=0 then;
  else if h_rule2.find()=0 then;
  /*** and so on ***/
run;

Elliott · Posted 07-17-2022 10:56 AM

Hi, this code works but I also need it to assign a value in a new variable called Rule, which is a 1. numeric that populates with 1 - 6 based on the matching rule. Then I also need any that do not match from the main file to be retained and assign Rule=0;

I tried a few ways to add to the code you provided but was not able to create the new variable. What modifications are necessary to create rule assignment and retain the unmatched records?

Thanks,

Patrick · Posted 07-17-2022 11:06 AM

Like this?

Elliott · Posted 07-17-2022 03:19 PM

Thank you very much, it works perfectly. I had tried that this morning but I must have had something else wrong because it did not work.
Thanks!

Elliott · Posted 07-18-2022 01:20 PM

Hi Patrick,

I have a question... how do I avoid having matches to missing acct /acctno from rule #4?

Thanks,

Elliott · Posted 07-18-2022 03:15 PM

it looks like the matching on missing values is also happening with acct/lastname and id. So my results are showing higher matches then there actually are.

I have been researching hash objects for several hours and I have not been able to figure out how to exclude the matching on missing values.

Thanks,

Patrick · Posted 07-18-2022 10:00 PM

Just exclude missings from the lookup and you will never get a match to missings.

data want;
  if _n_=1 then
    do;
      if 0 then set work.v_match_data(keep=recevdate);
      dcl hash h_rule1(dataset:'work.v_match_data');
      h_rule1.defineKey('pendingno','lastname');
      h_rule1.defineData('recevdate');
      h_rule1.defineDone();

      dcl hash h_rule2(dataset:'work.v_match_data');
      h_rule2.defineKey('pendingno');
      h_rule2.defineData('recevdate');
      h_rule2.defineDone();

      dcl hash h_rule3(dataset:'work.v_match_data');
      h_rule3.defineKey('acct','lastname');
      h_rule3.defineData('recevdate');
      h_rule3.defineDone();

      dcl hash h_rule4(dataset:'work.v_match_data(where=(not missing(acct))');
      h_rule4.defineKey('acct');
      h_rule4.defineData('recevdate');
      h_rule4.defineDone();
      /*** and so on ***/

    end;
  call missing(of _all_);

  set work.main_data;

  if h_rule1.find()=0 then rule=1;
  else if h_rule2.find()=0 then rule=2;
  else if h_rule3.find()=0 then rule=3;
  else if h_rule4.find()=0 then rule=4;
  /*** and so on ***/
  else rule=0;
run;

I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rules

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Re: I need to match data from 2 separate data sets based on multiple variables in a hierarchy of rul

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away