Solved: Matching Data

Saslearning2020 · Posted 01-05-2020 06:52 PM

Hello,

I am struggling to merge two databases and wondering whether I can get some advice from this community.

For example

Dataset A has three variables:

ID Seqnum Pay

1 7 -10

1 12 -10

1 13 -5

2 16 -15

3 20 -20

Dataset B has three variable

ID Seqnum Pay

1 8 10

1 9 10

1 10 10

1 11 5

1 14 15

2 15 15

2 17 25

2 18 15

3 19 20

3 21 25

All I want is each negative payment in my source file Dataset A should look in Dataset B, match by ID and set off the first matched positive payment. Output should be:

ID Seqnum Pay

1 10 10

1 14 15

2 17 25

2 18 15

3 21 25

Thanks

novinosrin · Posted 01-06-2020 04:48 PM

Hi @Saslearning2020 Pardon me for the delay. Methinks this slight change

 if sign(pay)=-1 and h.check(key:abs(pay)) then continue;

should suffice.

Full version :

data want ;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 and h.check(key:abs(pay)) then continue;
  if h.check()=0 then do;
    h.removedup();
    continue;
  end;
  output;
 end;
 h.clear();
run;

Please try and let me know. Thanks!

View solution in original post

novinosrin · Posted 01-05-2020 07:00 PM

HI @Saslearning2020 Welcome to SAS communities. I am afraid I am unable to comprehend

"All I want is each negative payment in my source file Dataset A should look in Dataset B, match by ID and set off the first matched positive payment"

Can you please explain and elaborate the logic with reference to your expect output?

Saslearning2020 · Posted 01-05-2020 07:14 PM

These are medical records. For example, patient 1 (ID=1) got charged for $5 (seqnum = 11), but then it was found that this charge was wrong and it got cancelled out （i.e. seqnum = 13, pay = -5). I want to get a cleaned data set without the positive records which got cancelled later. These two data sets are originally combined together and I separated them into 2 data sets with one only having negative payments and the other only having positive payments.

novinosrin · Posted 01-05-2020 08:51 PM

Hi @Saslearning2020 I don't think you need to split the data. You can keep as one and do bygroup processing.

Assuming your PAY values are only INTEGERS, it makes a sweet spot for KEY INDEXING paint brush method using ARRAYS. You could consider the following where I have combined A and B into one dataset and named that as HAVE

data a;
	input id seqnum pay;
	datalines;
1 12 -10
1 13 -5
2 16 -15
3 20 -20
;
run;

data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

data have;
set a b;
by 	id seqnum;
run;

data want;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  array t(99999) _temporary_;
  if sign(pay)=-1 then t(abs(pay))=1;
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 then continue;
  if t(pay) then do;
   call missing(t(pay));
   continue;
  end;
  output;
 end;
 call missing(of t(*));
run;

/*Or*/


data want ;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 then continue;
  if h.check()=0 then do;
    h.remove();
    continue;
  end;
  output;
 end;
 h.clear();
run;

@unison You win my friend. I understood the question on seeing your solution. I admire your great analytical skills not just on this thread but on many others. I am posting this solution for my fun. Yours is original. Cheers! Tonight was Heineken 🙂

jhammouda · Posted 01-06-2020 01:56 AM

Hi Novinosrin,

I think there are few issues with both the array and hash map solutions provided. You can test the code with the following data set (a), which essentially states that every transaction has been reversed, i.e The want set that you define should have zero observations. However, both the array and the hash map code produce a set that has 3 elements. The code trips on reverse transactions of the same amount for the same ID. Your code assumes that an amount of reversal (negative amount) is unique per ID. Having said that I do not think it would be reasonable to assume that the payments are integers and that they could be used as indices. I did not dig deeper into your code, but you are grouping by ID and Seqnum. Though you have not done anything with seqnum, one should not have any logic depending on seqnum given the information provided. The only assumption that could be made about seqnum is that a transaction that reverses a prior one, has a higher sequence number than the one it reversed.

I think there is merit to splitting the data as it leads to a straight forward merge (simple code, less error-prone) which is fairly efficient and would be more efficient than using one array that has all the data. Even if you prefer to solve this problem using an array you would want to have 2 arrays to take advantage of the fact that the number of reversals is typically much smaller than the number of all the paid transactions.

Best regards,

data a ;
input id seqnum pay;
datalines;
1 12 -10
1 13 -10
1 16 -10
1 30 -5
1 22 -15
2 24 -15
2 20 -25
2 23 -15
3 28 -20
3 27 -25
;
run;

novinosrin · Posted 01-06-2020 07:55 AM

Hi @jhammouda First off, Thank you for taking the time to write and explain at length what the requirement is. I really do like how you portrayed the paid and reversal of transactions. Second of all, your professional demeanor is very noteworthy and I would like to copy such traits in my posts while so far all my participation has been entertainment playing a video game called SAS.

Feeling obliged, I took your suggested sample A. Basically, the statement or grouping BY ID SEQNUM that you mentioned I am using in the logic isn't quite what it is. My apologies for not commenting nor explaining.

The original dataset before the split(which I still think is not needed) should be a combined A+B in the order of SEQNUM, that's all. Of course then +ve goes to B and -ve goes to A. Fair enough. Yes, the assumption of INTEGERS for ARRAY processing is illsuited though it's not hard to go little brute-force to achieve that for any number values of PAY. Nevertheless, HASH requires a very minor change i.e the inclusion of dcl hash H (multidata:'y') ; and h.removedup(); to address that.

So, Creating a new sample HAVE combining the suggested A, the full version is modified with the above 2 mentioned HASH changes


data a ;
input id seqnum pay;
datalines;
1 12 -10
1 13 -10
1 16 -10
1 30 -5
1 22 -15
2 24 -15
2 20 -25
2 23 -15
3 28 -20
3 27 -25
;
run;
data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

proc sort data=a;
by id seqnum;
run;

/*Creating the original dataset COMBINE*/
data have;
set a b;
by 	id seqnum;
run;

/*HASH modified*/


data want ;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 then continue;
  if h.check()=0 then do;
    h.removedup();
    continue;
  end;
  output;
 end;
 h.clear();
run;

jhammouda · Posted 01-06-2020 09:31 AM

That would work.

Saslearning2020 · Posted 01-06-2020 12:28 PM

Hi @novinosrin Thank you so much for the codes. It does the trick. After running the code on my original data, there came another issue. Due to entry errors, there are some negative payments which cannot be matched with positive payments. In such case, the current codes will not report these unmatched negative payments in the output. Is there a way to still show these unmatched negative payments? For example, if a dataline " 1 230 -1500" is added to data a and a dataline "1 231 1400" is added to data b,the current output is

id seqnum pay

1 231 1400

Is there a way to show the output as

id seqnum pay

1 230 -1500

1 231 1400

Thanks!

Saslearning2020 · Posted 01-06-2020 12:45 PM

Hi @novinosrin Thank you very much for the codes. It works for my data. After running it on my original data, there came one issue. Due to entry errors by billing staff, some negative payments cannot be matched. In this case, the current codes will not report the unmatched negative payments. Is there a way to still show them?

For example, if a dataline "1 230 -1500 " is added to data a and a dataline " 1 231 1400" is added to data b, then the current output is:

ID seqnum pay

1 231 1400

Is it possible to show:

ID seqnum pay

1 230 -1500

1 231 1400

Thanks.

novinosrin · Posted 01-06-2020 12:49 PM

Hey @Saslearning2020 Can you plz hang in there if you need my help. My boss and a few others are sitting besides me. I can look into it End of day.

It irks me they turned up to office but they usually work from home. I will try asap

Saslearning2020 · Posted 01-06-2020 12:59 PM

Of course whenever it's convenient for you. Thanks so much!

novinosrin · Posted 01-06-2020 04:48 PM

Hi @Saslearning2020 Pardon me for the delay. Methinks this slight change

 if sign(pay)=-1 and h.check(key:abs(pay)) then continue;

should suffice.

Full version :

data want ;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 and h.check(key:abs(pay)) then continue;
  if h.check()=0 then do;
    h.removedup();
    continue;
  end;
  output;
 end;
 h.clear();
run;

Please try and let me know. Thanks!

Saslearning2020 · Posted 01-07-2020 10:36 PM

@novinosrin Thanks. It works. Really appreciated.

jhammouda · Posted 01-06-2020 01:50 PM

Hi Saslearning2020,

Try this, it should do what you wanted. By the may my original posting had a missing by ID pay statement in thee merge data step.

Good luck

data a;
	input id seqnum pay;
	
	datalines;
1 12 -10
1 13 -10
1 16 -10	
1 30 -5
1 22 -15
1 230 -1500
2 24 -15
2 20 -25
2 23 -15
3 28 -20
3 27 -25
;
run;


data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
1 231 1400
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

data pos_a ;
	set a;
	pay=-pay;
run;

proc sort data=b;
	by ID pay;
run;

proc sort data=pos_a;
	by ID pay;
run;

data non_matched;
	merge b(in=C) pos_a(in=R);
	by id pay;
	if R=0 then output;* this is in b but not in pos_a;
	else if C=0 then do;*this is in pos_a but not in b;
	pay=-pay;*convert it back to to its original negtive form;
	output;
	end;
run;

unison · Posted 01-05-2020 07:32 PM

Looks like transaction data? You're trying to keep 'un-balanced' transactions?

Take a look at this...

data a;
	input id seqnum pay;
	datalines;
1 12 -10
1 13 -5
2 16 -15
3 20 -20
;
run;

data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

data desired_output;
	input id seqnum pay;
	datalines;
1 9 10
1 10 10
1 14 15
2 17 25
2 18 15
3 21 25
;
run;

/* Edit a so that pay=-1*pay */
data a_edited(rename=(pay_neg=pay));
	set a;
	pay_neg=-1*pay;
	drop pay;
run;


/* Sort and merge */
proc sort data=b;
	by id pay;
run;

proc sort data=a_edited;
	by id pay;
run;

data want;
	merge b(in=B) a_edited(in=a rename=(seqnum=seqnum_a));
	by id pay;

	if b;

	if not first.pay or seqnum_a=.;
	drop seqnum_a;
run;

/* Checking that desired_output=want */
proc sort data=want;
	by id seqnum;
run;

proc compare base=desired_output compare=want;
run;

-unison

Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Re: Matching Data

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!