BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Elliott
Obsidian | Level 7

I have 2 datasets that I need to compare records to see if variable values from each match.

If they match on the specific variables I assign a Rule number 1 - 6.

There is a hierarchy so if the record matches on rule 1 then it does not need to be evaluated again.

There will also be records that do not match that are contained in the main dataset and need to be retained.

 

MATCHING HIERACHY:
1) pendingno = refno and lastname = last_name then rule = 1 and add recevdate (rule and recevdate are added to the data from the main_data record)
2) pendingno = refno then rule = 2 and add recevdate (rule and recevdate are added to the data from the main_data record)
3) acct = acctno and lastname = last_name then rule = 3 and add recevdate (rule and recevdate are added to the data from the main_data record)
4) acct = acctno then rule = 4 and add recevdate (rule and recevdate are added to the data from the main_data record)
5) id = sid then rule = 5 and add recevdate (rule and recevdate are added to the data from the main_data record)
6) lastname = last_name and firstname = first_name then rule = 6 and add recevdate (rule and recevdate are added to the data from the main_data record)

There will be records that don't have any matches from the main_data they will just have blanks for rule and recevdate.

 

Have:

Main Data:

data work.main_data;
infile datalines dlm=',' dsd;
length lastname firstname $25 store $5 pendingno $13 id $10 acct $16;
input lastname firstname pendingno id acct $ assigndate :date9.;
format assigndate date9.;
datalines;

JONES, JACK, 05522, A123456B85946, 079458678, 4569874589375158, 04JUN2021
LYONS, LOIS,  04437, A123357B85946, 129458646, 6559874589375321, 22JAN2022
WILLIAMS, JOHN, 05522, A123478C25946, 379429678, 5429874589374758, 07AUG2021
BETZINA, ANN,  04437, D983456B85946, 469457578, 4569874589375894, 15OCT2021
JOYES, JACK,   00301, A123456B85946, 279258678, 3729874589375158, 04SEP2021
SAMS, MACK,   02747, A128547B85946, 654458646, 455874589375321, 27MAY2022
WILLIS, JANE,   00301, A123458X25946, 379447878, 5412345589374758, 13AUG2021
BETXINA, RUTH,   04437, D983426B85946, 269453578, 4569874589387651, 25DEC2021
DAVIS, DAVID,   05522, B123458X25136, 379442778, 5418445589374758, 13APR2021
BATHE, ROBERT,   04437, D983716B85946, 264853578, 4569891589387657, 25MAR2022
;
run;

Match to Data:

data work.match_data;
infile datalines dlm=',' dsd;
length last_name first_name $25 refno $13 sid $10 acctno $16;
input last_name first_name refno sid acctno $ recevdate :date9.;
format recevdate date9.;
datalines;

JONES, JACK, A123456B85946, 079458678, 4569874589375158, 05JUN2021
LYONS, LOIS, A123357B85946, , 6559874589375321, 23JAN2022
WILLIAMS, JAMES, , 379429678, 5429874589374758, 09AUG2021
BETZINA, ANN, D983456B85946, 469457578, 4569874589375158, 18OCT2021
JOYES, JACK, A123456B85946, 279258678, 3729874589375158, 07SEP2021
SAMS, MACK, A128547B85946, 654458646, , 28MAY2022
WILLS, JADE, A123458X25946, 379447878, 5412345589374758, 15AUG2021
BETXINA, RUTH, D983426B85946, 269453578, 4569874589387651, 27DEC2021
WILLIAMS, JOHN, A123478C25946, 379429678, 5429874589374758, 08AUG2021

;

run;

 

I am at a loss as to how to do the matching with the hierarchy and get the output I need to summarize all the records from the main_data.

 

I have tried data steps and proc sql  but there is not just 1 field in both that can join the records so I cannot seem to wrap my head around the matching hierarchy and how to get the right result.

 

Then I have to take the data and summarize based on store with counts for how many records in each rule or how many mismatched.

 

Any assistance will be greatly appreciated.

Thanks,

Elliott

 

1 ACCEPTED SOLUTION
11 REPLIES 11
andreas_lds
Jade | Level 19

It is not clear what you want to compare: each obs in "main_data" with each obs in "match_data", or just first obs with first obs, second with second ... ?

 

If you want the later, use merge without by, rename the variables in the second dataset  and use if-then-else to set rule_number.

Elliott
Obsidian | Level 7

I want to evaluate every record in main_data for matches to match_data if there is a match then set rule value based on the hierarchy, so if the match is the #1 match then that is the highest level, if it is not matched to 1 then evaluate for a match to #2 etc.  so the highest level match should be retained in the rule variable.   The only variable out of the match_data that needs to be captured on the record is the date value.

FreelanceReinh
Jade | Level 19

Hello @Elliott,

 

For each of the six rules you could create an index on dataset MATCH_DATA and then use SET statements with the KEY= option until a match is found or all rules have been applied unsuccessfully:

proc datasets lib=work nolist;
modify match_data;
  index create reflast=(refno last_name);
  index create refno;
  index create acclast=(acctno last_name);
  index create acctno;
  index create sid;
  index create name=(last_name first_name);
quit;

data want(drop=last_name first_name refno sid acctno);
set main_data;
refno=pendingno;
last_name=lastname;
set match_data key=reflast/unique;
if _iorc_=0 then rule=1;
else do;
  set match_data key=refno/unique;
  if _iorc_=0 then rule=2;
  else do;
    acctno=acct;
    set match_data key=acclast/unique;
    if _iorc_=0 then rule=3;
    else do;
      set match_data key=acctno/unique;
      if _iorc_=0 then rule=4;
      else do;
        sid=id;
        set match_data key=sid/unique;
        if _iorc_=0 then rule=5;
        else do;
          first_name=firstname;
          set match_data key=name/unique;
          if _iorc_=0 then rule=6;
          else recevdate=.;
        end;
      end;
    end;
  end;
end;
_error_=0;
run;

I'm not sure, though, of the performance and disk space requirements if MATCH_DATA is very large.

 

Elliott
Obsidian | Level 7

I tried this query on my sample data and it was taking so long to run on just 12  sample records I had to stop it, so I don't want to try with 10000+ records.  But Thank you, I have learned something new.

Patrick
Opal | Level 21

@Elliott wrote:

I tried this query on my sample data and it was taking so long to run on just 12  sample records I had to stop it, so I don't want to try with 10000+ records.  But Thank you, I have learned something new.


If the match table got only some 10000+ records then a hash approach as below should work.

data work.main_data;
  infile datalines dlm=',' dsd truncover;
  length lastname firstname $25 store $5 pendingno $13 id $10 acct $16;
  input lastname firstname store pendingno id acct $ assigndate :date9.;
  format assigndate date9.;
  datalines;
JONES,JACK,05522,A123456B85946,079458678,4569874589375158,04JUN2021
LYONS,LOIS,04437,A123357B85946,129458646,6559874589375321,22JAN2022
WILLIAMS,JOHN,05522,A123478C25946,379429678,5429874589374758,07AUG2021
BETZINA,ANN,04437,D983456B85946,469457578,4569874589375894,15OCT2021
JOYES,JACK,00301,A123456B85946,279258678,3729874589375158,04SEP2021
SAMS,MACK,02747,A128547B85946,654458646,455874589375321,27MAY2022
WILLIS,JANE,00301,A123458X25946,379447878,5412345589374758,13AUG2021
BETXINA,RUTH,04437,D983426B85946,269453578,4569874589387651,25DEC2021
DAVIS,DAVID,05522,B123458X25136,379442778,5418445589374758,13APR2021
BATHE,ROBERT,04437,D983716B85946,264853578,4569891589387657,25MAR2022
;

data work.match_data;
  infile datalines dlm=',' dsd truncover;
  length last_name first_name $25 refno $13 sid $10 acctno $16;
  input last_name first_name refno sid acctno $ recevdate :date9.;
  format recevdate date9.;
  datalines;
JONES,JACK,A123456B85946,079458678,4569874589375158,05JUN2021
LYONS,LOIS,A123357B85946,,6559874589375321,23JAN2022
WILLIAMS,JAMES,,379429678,5429874589374758,09AUG2021
BETZINA,ANN,D983456B85946,469457578,4569874589375158,18OCT2021
JOYES,JACK,A123456B85946,279258678,3729874589375158,07SEP2021
SAMS,MACK,A128547B85946,654458646,,28MAY2022
WILLS,JADE,A123458X25946,379447878,5412345589374758,15AUG2021
BETXINA,RUTH,D983426B85946,269453578,4569874589387651,27DEC2021
WILLIAMS,JOHN,A123478C25946,379429678,5429874589374758,08AUG2021
;

data work.v_match_data /view=work.v_match_data;
  set work.match_data;
  rename 
    refno=pendingno
    last_name=lastname
    first_name=firstname
    acctno = acct
    sid = id
    ;
run;

data want;
  if _n_=1 then
    do;
      if 0 then set work.v_match_data(keep=recevdate);
      dcl hash h_rule1(dataset:'work.v_match_data');
      h_rule1.defineKey('pendingno','lastname');
      h_rule1.defineData('recevdate');
      h_rule1.defineDone();

      dcl hash h_rule2(dataset:'work.v_match_data');
      h_rule2.defineKey('pendingno');
      h_rule2.defineData('recevdate');
      h_rule2.defineDone();
      /*** and so on ***/

    end;
  call missing(of _all_);

  set work.main_data;

  if h_rule1.find()=0 then;
  else if h_rule2.find()=0 then;
  /*** and so on ***/
run;
Elliott
Obsidian | Level 7

Hi, this code works but I also need it to assign a value in a new variable called Rule, which is a 1. numeric that populates with 1 - 6 based on the matching rule.  Then I also need any that do not match from the main file to be retained and assign Rule=0;

I tried a few ways to add to the code you provided but was not able to create the new variable.  What modifications are necessary to create rule assignment and retain the unmatched records?

 

Thanks,

Elliott
Obsidian | Level 7
Thank you very much, it works perfectly. I had tried that this morning but I must have had something else wrong because it did not work.
Thanks!
Elliott
Obsidian | Level 7

Hi Patrick,

 

I have a question... how do I avoid having matches to missing acct /acctno from rule #4?

 

Thanks,

 

Elliott
Obsidian | Level 7

it looks like the matching on missing values is also happening with acct/lastname and id.  So my results are showing higher matches then there actually are.

 

I have been researching hash objects for several hours and I have not been able to figure out how to exclude the matching on missing values.

 

Thanks,

Patrick
Opal | Level 21

Just exclude missings from the lookup and you will never get a match to missings.

data want;
  if _n_=1 then
    do;
      if 0 then set work.v_match_data(keep=recevdate);
      dcl hash h_rule1(dataset:'work.v_match_data');
      h_rule1.defineKey('pendingno','lastname');
      h_rule1.defineData('recevdate');
      h_rule1.defineDone();

      dcl hash h_rule2(dataset:'work.v_match_data');
      h_rule2.defineKey('pendingno');
      h_rule2.defineData('recevdate');
      h_rule2.defineDone();

      dcl hash h_rule3(dataset:'work.v_match_data');
      h_rule3.defineKey('acct','lastname');
      h_rule3.defineData('recevdate');
      h_rule3.defineDone();

      dcl hash h_rule4(dataset:'work.v_match_data(where=(not missing(acct))');
      h_rule4.defineKey('acct');
      h_rule4.defineData('recevdate');
      h_rule4.defineDone();
      /*** and so on ***/

    end;
  call missing(of _all_);

  set work.main_data;

  if h_rule1.find()=0 then rule=1;
  else if h_rule2.find()=0 then rule=2;
  else if h_rule3.find()=0 then rule=3;
  else if h_rule4.find()=0 then rule=4;
  /*** and so on ***/
  else rule=0;
run;

Patrick_0-1658196003157.png

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2305 views
  • 0 likes
  • 4 in conversation