BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
constliv
Obsidian | Level 7

I have two datasets on SAS EG version 5.1 (putting on Base SAS board because I assume the readership overlaps):

 

Dataset A: 

ID Customer_number    Prod_code    Balance 

1 0001-000111              HOL                100  

2 001-000111                HOL                 90 

3 001-999999               HOL                  26 

4 001-000111               HOL                  90 

5 001-000111               KLX                  90 

 

Dataset B : 

ID    Customer_number   Prod_code   Balance    Pre_CCF_Balance  EAD_Balance 

100  0001-000111         HOL             100           10                             20

200  001-000111           HOL             20             20                             10 

300  001-000111           HOL             90             90                              45 

400  001-999999           HOL             26             26                              10 

500  001-999999           HOL            90             90                               45 

600  001-000111           HOL            90            90                                45 

I want the program to take record id=1 from A and check each record in B, and if customer_number/ prod_code and balance matches with customer_number/ prod_code and either balance / pre_ccf_balance / ead_balance in B, then at the first match, regardless of which column matches, at the identification of first match to tag record in B as matched and record in A as matched. Then select id=2 from A and proceed again, but don't use any already matched records in B. 

 

So, for the example, output will be as follows (I also want a new column in both datasets telling which records matched with which). 

Dataset A: 

ID  Customer_number   Prod_code  Balance   Matched 

1    0001-000111           HOL            100          Matched with ID 100 in B 

2    001-000111             HOL             90           Matched with ID 300 in B 

3    001-999999             HOL             26           Matched with ID 400 in B 

4    001-000111             HOL            90            Matched with ID 600 in B (since 300 already used for id 2 above). 

5    001-000111             KLX           90              NOT MATCHED 

 

Dataset B : 

ID    Customer_number  Prod_code  Balance  Pre_CCF_Balance   EAD_Balance  Matched 

100  0001-000111          HOL            100          10                           20                      Matched with id 1 in A 

200  001-000111            HOL              20          20                           10                      NOT MATCHED 

300  001-000111            HOL              90          90                           45                      Matched with id 2 in A 

400  001-999999            HOL              26           26                          10                       Matched with id 3 in A 

500  001-999999            HOL             90         90                             45                      NOT MATCHED 

600  001-000111            HOL             90        90                              45                      Matched with id 4 in A

 

I tried approaching this using 2 set statements, but though it seems promising I am not able to nail the need to stop searching after first match, and excldue an observation in B from being matched once it is matched.

 

data t;

set survey;

put _n_ =;

length rem1 $100. rem2 $100. rem3 $100. matched_survey_2 $15.;

retain matched_survey_2;

do I = 1 TO NOBS;

set survey_2 NOBS=NOBS POINT=I;

if id_1 = id_2 and age_1 = r1 then

do;

rem1 = 'Age matched with r1 !!!';

output t;

end;

else if id_1 = id_2 and age_1 = r2 then

do;

point_c = put (i , z1.);

rem2 = 'Age matched with r2';

output t;

end;

end;

run;

 

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@constliv

Right - I've missed balance in table B. Below tweak should do.

data want;

  if _n_=1 then 
    do;
      if 0 then set a(keep=id balance rename=(id=id_a balance=balance_a));
      dcl hash h1(dataset:'a(rename=(id=id_a balance=balance_a))', multidata:'y');
      h1.defineKey('customer_number', 'Prod_code', 'balance_a');
      h1.defineData('id_a');
      h1.defineDone();
    end;
  call missing(id_a, balance_a);

  set b;

  do balance_a= balance, Pre_CCF_Balance, EAD_Balance;
    if h1.find()=0 then
      do;
        h1.removedup();
        leave;
      end;
  end;

run;

As for hash tables: They are basically nothing else than lookup tables in memory. You will need to learn the syntax as it's object oriented.

One of the great things about SAS hash tables: You can create and modify records during run time (=during data step iteration). So as soon as you have requirements where you want to only read once from a matching record or you want to create something like blacklist/whitelists during runtime ...consider using SAS hash tables. It's really worth learning how to use them. It's all in the SAS documentation.

View solution in original post

16 REPLIES 16
Astounding
PROC Star

Good decision to post just once instead of twice.

 

I think MERGE would make this easier.  It's still a little clumsy, but consider this approach.  Note that you have many common variables, and it's not clear what you need to merge by.  So you will need to take this approach and apply a lot of renaming to avoid conflicts and overwriting.  Since I know you need to pull out the matching ID, I'm illustrating the renaming just for ID:

 

data new_a (keep=just needed variables) new_b (keep=just needed variables);

in_a = 0;

in_b = 0;

merge a (in=in_a) b (in=in_b rename=(id=ID_B));

by customer_number;

if in_b then do;

   if in_a then matched = catx(' ', 'Matched with ID', ID, 'in A');

   else matched = 'Not Matched';

   output new_b;

end;

if in_a then do;

   if in_b then matched = catx(' ', 'Matched with ID', ID_B, 'in B');

   else matched = 'Not Matched';

   output new_a;

end;

call missing(ID, Prod_code, Balance, Pre_CCF_Balance, EAD_Balance);

run;

 

You will get notes about a many-to-many match, but this is the rare case in which those notes can be ignored.

constliv
Obsidian | Level 7
" it's not clear what you need to merge by" : merging will be based on customer_number and prod_code ; however the challenge is if the field balance in dataset A matches with either field balance or field "pre_ccf" or field "ead_balance" in dataset B, then it should be considered a match. The record thus matched in B should not be considered for any further matches at all. Move to the next record in dataset A and look for 'matches' in the remaining records of B.

Will explore your merge suggestion for my data ; meanwhile please see if my clarifications help. Thank you.
Astounding
PROC Star

Unfortunately, MERGE is not viable under those conditions.  It would be possible that the first observation in A to matches the third observation in B (regardless of the sorted order, that's a possibility).  MERGE would not let you re-use the 1st and 2nd observations from B in that case, it would merely find them unmatched.

Patrick
Opal | Level 21

@constliv

Please post sample data if possible via a working data step so we don't have to do this work for you in order to actually test what we propose.

 

Does below return the result you're after?

data a;
  infile datalines truncover dlm=' ';
  input ID $ Customer_number:$20. Prod_code $ Balance;
  datalines;
1 0001-000111 HOL 100 
2 001-000111 HOL 90 
3 001-999999 HOL 26 
4 001-000111 HOL 90 
5 001-000111 KLX 90 
;
run;

Data B;
  input ID $ Customer_number:$20. Prod_code $ Balance Pre_CCF_Balance EAD_Balance;
  infile datalines truncover dlm=' ';
  datalines;
100 0001-000111 HOL 100 10 20
200 001-000111 HOL 20 20 10 
300 001-000111 HOL 90 90 45 
400 001-999999 HOL 26 26 10 
500 001-999999 HOL 90 90 45 
600 001-000111 HOL 90 90 45 
;
run;


data want(drop=Balance);
  if _n_=1 then 
    do;
      if 0 then set a(rename=(id=id_a));
      dcl hash h1(dataset:'a(rename=(id=id_a))', multidata:'y');
      h1.defineKey('customer_number', 'Prod_code', 'balance');
      h1.defineData('id_a');
      h1.defineDone();
    end;
  call missing(id_a);

  set b;

  balance=Pre_CCF_Balance;
  if h1.find()=0 then
    do;
      h1.removedup();
    end;
  else
    do;
      balance=EAD_Balance;
      if h1.find()=0 then h1.removedup();
    end;
    
run;

constliv
Obsidian | Level 7

The first record in dataset A should have matched with the first record in dataset B as the customer number and product code are the same, and field balance in A matches with field balance in B for these records. However, the code is not showing this match (variable id_a is blank for record with ID = 100 in dataset want).

 

Secondly, I would have expected ID=500 to match with id_a = 4, because sequentially when looking for a match for id=4 from dataset A in dataset B, the record with id-500 in dataset B is the top most match for id=4. But in dataset want, id_a = 600 is reported. Not sure if this will cause erroenous results.

 

Finally, I prefer the output to be in terms of dataset A as the base, with the equivalent of id_a added as a column to dataset A. Right now , it is the other way round. So id_a will have values like 100 ,200 etc in the solution I prefer.

 

(regarding posting sample datastep, apologies I couldn't get that right. I tried copying the code from SAS EG and it seemed to go out of alignment and I inserted spaces between columns which seems to have defeated the purpose...will figure out how to get it right next time).

constliv
Obsidian | Level 7
Please disregard my below comment. This is not an error in the suggested program, the customer numbers are different, so no match

"
Secondly, I would have expected ID=500 to match with id_a = 4, because sequentially when looking for a match for id=4 from dataset A in dataset B, the record with id-500 in dataset B is the top most match for id=4. But in dataset want, id_a = 600 is reported. Not sure if this will cause erroenous results."
Patrick
Opal | Level 21

@constliv

"Secondly, I would have expected ID=500 to match with id_a = 4"

Why? In the data you've provided and which I've used in the code the customer_numbers are different between these records.

id_a=4 Customer_number=001-000111

id=500 Customer_number=001-999999

 

Aren't you only looking for matches with identical Customer_Number AND Prod_Code? That was my understanding of your requirement.

....or is it also a match for you if only one of these columns matches - Customer_Number OR Prod_Code?

constliv
Obsidian | Level 7

Patrick, you are perfectly correct. Please disregard that part of my reply, ""Secondly, I would have expected ID=500 to match with id_a = 4"".

Patrick
Opal | Level 21

@constliv

Soo... any remainder or does the code I've posted return the expected result?

 

...oh, I see, that's what you're talking about:

"Finally, I prefer the output to be in terms of dataset A as the base, with the equivalent of id_a added as a column to dataset A. Right now , it is the other way round. So id_a will have values like 100 ,200 etc in the solution I prefer."

 

All the code logic is there. I suggest you give this a go first and come back if you can't manage to make it work.

constliv
Obsidian | Level 7
One outstanding issue is that the first record in dataset A should have matched first record in dataset B, and that is not happening through your code.

In terms of requirement, this means that if customer_number and prod_code match and then the balance in dataset A matches with field any of the fields balance, pre_ccf_balance or ead_balance then it should be considered a match. No issue if the balance field is renamed in dataset b.

Hash programming is all Greek and Latin to me...still I frustrating your code by copying balance = balance_b , but got unexpected results.

If this can be fixed , I can use your code as is and would be delighted.
constliv
Obsidian | Level 7
Please read as, "still I tried tweaking your code by copying the find block for balance = balance_b , but got unexpected results."
Patrick
Opal | Level 21

@constliv

Right - I've missed balance in table B. Below tweak should do.

data want;

  if _n_=1 then 
    do;
      if 0 then set a(keep=id balance rename=(id=id_a balance=balance_a));
      dcl hash h1(dataset:'a(rename=(id=id_a balance=balance_a))', multidata:'y');
      h1.defineKey('customer_number', 'Prod_code', 'balance_a');
      h1.defineData('id_a');
      h1.defineDone();
    end;
  call missing(id_a, balance_a);

  set b;

  do balance_a= balance, Pre_CCF_Balance, EAD_Balance;
    if h1.find()=0 then
      do;
        h1.removedup();
        leave;
      end;
  end;

run;

As for hash tables: They are basically nothing else than lookup tables in memory. You will need to learn the syntax as it's object oriented.

One of the great things about SAS hash tables: You can create and modify records during run time (=during data step iteration). So as soon as you have requirements where you want to only read once from a matching record or you want to create something like blacklist/whitelists during runtime ...consider using SAS hash tables. It's really worth learning how to use them. It's all in the SAS documentation.

constliv
Obsidian | Level 7

@Patrick

Your tweaked code works perfectly for my requirements. Thank you.

 

I noticed the structure of the tweaked code is quite different from the original version. Will study more about hash tables given their power.

 

@Astounding

 

Thank you for your suggestions and important clarification on the limitations of match merge.

Patrick
Opal | Level 21

@constliv

The code might look quite different but it's logically pretty close to what I've posted initially.

 

Just to give you a head start in regards of getting up-to-speed with hash tables.

 

In order to use a variable in a hash table you must define it in the data step. Below line of code defines variables which don't get defined already via "set B".

Because a SAS hash table gets only created during run time SAS statements defining variables used in a hash can be anywhere in the code (also after the hash declaration). Below code serves no other purpose than to map variables not defined anywhere else.

      if 0 then set a(keep=id balance rename=(id=id_a balance=balance_a));

The next code snippet defines the hash table. Because hash tables get defined during run time and can get defined anytime during data step iteration, we need to ensure that this happens only once (-> if _N_=1...)

  if _n_=1 then 
    do;
      if 0 then set a(keep=id balance rename=(id=id_a balance=balance_a));
      dcl hash h1(dataset:'a(rename=(id=id_a balance=balance_a))', multidata:'y');
      h1.defineKey('customer_number', 'Prod_code', 'balance_a');
      h1.defineData('id_a');
      h1.defineDone();
    end;

To look up values in a hash table there are two methods: FIND() and CHECK()   They are called methods and not functions as it's object oriented 5th GL programming language syntax. It's <hash>.<method()>

If a method is successful then it returns zero, else it returns some other value (Remark: I find that still confusing. Intuitively I'd expect a 1 for TRUE and not a 0 which normally means FALSE ...but that's how it works. If you have a match it's a zero).

 

The CHECK() method just returns a value telling you if there was a match over the keys - as defined in h1.defineKey(...). The FIND() method does the same but additionally also returns the values as defined in h1.defineData(...)

 

So now for below code bit

if h1.find()=0 then

If there is a match then the FIND() method return a zero and though the condition 0=0 becomes TRUE. And because the FIND() method had a match it also returns the values defined in h1.defineData(...) to the SAS data step - in our case the values for id_a and balance_a.

 

And then if the condition...

if h1.find()=0

....is true we also need to delete the matching record from the hash so it doesn't get used anymore. That's what code...

h1.removedup();

...does. It deletes the matching record from the hash. Because we've used in the hash declare statement...

multidata:'y'

...there could be multiple records with the same key and we need to use method removedup() to only delete the currently matching record. Method remove() would delete all records with the same key.

 

And last but not least:

  do balance_a= balance, Pre_CCF_Balance, EAD_Balance;

 

This statement populates data step variable balance_a as we need this variable to look-up values in the hash (which has a same named key variable). Because balance_a gets populated in the data step it will always have a value even if there hasn't been a match.

If you want your code to be "perfect" then you should add at the end some logic like...

  if missing(id_a) then call missing(balance_a);

...so balance_a remains only populated if there had been a match to table A.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 2738 views
  • 3 likes
  • 3 in conversation