BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Saslearning2020
Calcite | Level 5

Hello,

I am struggling to merge two databases and wondering whether I can get some advice from this community.

 

For example

Dataset A has three variables:

ID Seqnum Pay

1  7   -10

1  12   -10

1  13   -5

2  16   -15

3  20   -20

 

Dataset B has three variable

ID  Seqnum  Pay

1    8   10

1    9   10

1    10   10

1    11   5

1    14   15

2    15   15

2    17   25

2    18   15

3    19  20

3    21  25

 

All I want is each negative payment in my source file Dataset A should look in Dataset B, match by ID and set off the first matched positive payment. Output should be:

ID  Seqnum  Pay

1    10   10

1    14   15

2    17   25

2    18   15

3    21  25

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Hi @Saslearning2020  Pardon me for the delay.  Methinks this slight change

 if sign(pay)=-1 and h.check(key:abs(pay)) then continue;

should suffice.

 

Full version :

 

data want ;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 and h.check(key:abs(pay)) then continue;
  if h.check()=0 then do;
    h.removedup();
    continue;
  end;
  output;
 end;
 h.clear();
run;

Please try and let me know. Thanks!

View solution in original post

16 REPLIES 16
novinosrin
Tourmaline | Level 20

HI @Saslearning2020  Welcome to SAS communities. I am afraid I am unable to comprehend 

 

"All I want is each negative payment in my source file Dataset A should look in Dataset B, match by ID and set off the first matched positive payment"

 

Can you please explain and elaborate the logic with reference to your expect output?

Saslearning2020
Calcite | Level 5

These are medical records. For example,  patient 1 (ID=1) got charged for $5 (seqnum = 11), but then it was found that this charge was wrong and it got cancelled out (i.e. seqnum = 13, pay = -5). I want to get a cleaned data set without the positive records which got cancelled later. These two data sets are originally combined together and I separated them into 2 data sets with one only having negative payments and the other only having positive payments.

novinosrin
Tourmaline | Level 20

Hi @Saslearning2020   I don't think you need to split the data. You can keep as one and do bygroup processing.

Assuming your PAY values are only INTEGERS, it makes a sweet spot for KEY INDEXING paint brush method using ARRAYS. You could consider the following where I have combined A and B into one dataset and named that as HAVE

 

data a;
	input id seqnum pay;
	datalines;
1 12 -10
1 13 -5
2 16 -15
3 20 -20
;
run;

data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

data have;
set a b;
by 	id seqnum;
run;

data want;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  array t(99999) _temporary_;
  if sign(pay)=-1 then t(abs(pay))=1;
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 then continue;
  if t(pay) then do;
   call missing(t(pay));
   continue;
  end;
  output;
 end;
 call missing(of t(*));
run;

/*Or*/

 


data want ;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 then continue;
  if h.check()=0 then do;
    h.remove();
    continue;
  end;
  output;
 end;
 h.clear();
run;

 

@unison   You win my friend. I understood the question on seeing your solution. I admire your great analytical skills not just on this thread but on many others. I am posting this solution for my fun. Yours is original. Cheers! Tonight was Heineken 🙂

 

jhammouda
Obsidian | Level 7

Hi Novinosrin,

 

I think there are few issues with both the array and hash map solutions provided. You can test the code with the following data set (a), which essentially states that every transaction has been reversed, i.e  The want set that you define should have zero observations. However, both the array and the hash map code produce a set that has 3 elements. The code trips on reverse transactions of the same amount for the same ID. Your code assumes that an amount of reversal (negative amount) is unique per ID. Having said that I do not think it would be reasonable to assume that the payments are integers and that they could be used as indices. I did not dig deeper into your code, but you are grouping by ID and Seqnum. Though you have not done anything with seqnum, one should not have any logic depending on seqnum given the information provided. The only assumption that could be made about seqnum is that a transaction that reverses a prior one, has a higher sequence number than the one it reversed. 

 

I think there is merit to splitting the data as it leads to a straight forward merge (simple code, less error-prone) which is fairly efficient and would be more efficient than using one array that has all the data. Even if you prefer to solve this problem using an array you would want to have 2 arrays to take advantage of the fact that the number of reversals is typically much smaller than the number of all the paid transactions.

 

Best regards,

 

data a ;
input id seqnum pay;
datalines;
1 12 -10
1 13 -10
1 16 -10
1 30 -5
1 22 -15
2 24 -15
2 20 -25
2 23 -15
3 28 -20
3 27 -25
;
run;

novinosrin
Tourmaline | Level 20

Hi @jhammouda  First off, Thank you for taking the time to write and explain at length what the requirement is. I really do like how you portrayed the paid and reversal of transactions. Second of all, your professional demeanor is very noteworthy and I would like to copy such traits in my posts while so far all my participation has been entertainment playing a video game called SAS. 

 

Feeling obliged, I took your suggested sample A. Basically, the statement or grouping BY ID SEQNUM that you mentioned I am using in the logic isn't quite what it is. My apologies for not commenting nor explaining. 

 

The original dataset before the split(which I still think is not needed) should be a combined A+B in the order of SEQNUM, that's all.  Of course then +ve goes to B and -ve goes to A. Fair enough. Yes, the assumption of INTEGERS for ARRAY processing is illsuited though it's not hard to go little brute-force to achieve that for any number values of PAY. Nevertheless, HASH requires a very minor change i.e  the inclusion of dcl hash H (multidata:'y') ; and  h.removedup(); to address that.

 

So, Creating a new sample HAVE combining the suggested A, the full version is modified with the above 2 mentioned HASH changes 

 


data a ;
input id seqnum pay;
datalines;
1 12 -10
1 13 -10
1 16 -10
1 30 -5
1 22 -15
2 24 -15
2 20 -25
2 23 -15
3 28 -20
3 27 -25
;
run;
data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

proc sort data=a;
by id seqnum;
run;

/*Creating the original dataset COMBINE*/
data have;
set a b;
by 	id seqnum;
run;

/*HASH modified*/


data want ;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 then continue;
  if h.check()=0 then do;
    h.removedup();
    continue;
  end;
  output;
 end;
 h.clear();
run;

 

 

 

 

jhammouda
Obsidian | Level 7

That would work.

Saslearning2020
Calcite | Level 5

Hi @novinosrin Thank you so much for the codes. It does the trick. After running the code on my original data, there came another issue. Due to entry errors, there are some negative payments which cannot be matched with positive payments. In such case, the current codes will not report these unmatched negative payments in the output. Is there a way to still show these unmatched negative payments? For example, if a dataline " 1 230 -1500" is added to data a  and a dataline "1 231 1400" is added to data b,the current output is 

id seqnum pay

1 231 1400

Is there a way to show the output as 

id seqnum pay

1 230 -1500

1 231 1400

 

Thanks!

Saslearning2020
Calcite | Level 5

Hi @novinosrin  Thank you very much for the codes. It works for my data. After running it on my original data, there came one issue. Due to entry errors by billing staff, some negative payments cannot be matched. In this case, the current codes will not report the unmatched negative payments. Is there a way to still show them?  

 

For example, if a dataline "1 230 -1500 " is added to data a and a dataline " 1 231 1400" is added to data b, then the current output is:

ID seqnum pay

1 231 1400

Is it possible to show:

ID seqnum pay

1 230 -1500

1 231 1400

Thanks.

 

novinosrin
Tourmaline | Level 20

Hey @Saslearning2020   Can you plz hang in there if you need my help. My boss and a few others are sitting besides me. I can look into it End of day. 

 

It irks me they turned up to office but they usually work from home. Smiley Sad I will try asap

Saslearning2020
Calcite | Level 5
Of course whenever it's convenient for you. Thanks so much!
novinosrin
Tourmaline | Level 20

Hi @Saslearning2020  Pardon me for the delay.  Methinks this slight change

 if sign(pay)=-1 and h.check(key:abs(pay)) then continue;

should suffice.

 

Full version :

 

data want ;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("pay") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.id);
  set have;
  by id;
  if sign(pay)=-1 then _iorc_=h.add(key:abs(pay),data:abs(pay));
 end;
 do _n_=1 to _n_;
  set have;
  if sign(pay)=-1 and h.check(key:abs(pay)) then continue;
  if h.check()=0 then do;
    h.removedup();
    continue;
  end;
  output;
 end;
 h.clear();
run;

Please try and let me know. Thanks!

Saslearning2020
Calcite | Level 5

@novinosrin  Thanks. It works. Really appreciated.

jhammouda
Obsidian | Level 7

Hi Saslearning2020,

 

Try this, it should do what you wanted. By the may my original posting had a missing by ID pay statement in thee merge data step.

 

Good luck

data a;
	input id seqnum pay;
	
	datalines;
1 12 -10
1 13 -10
1 16 -10	
1 30 -5
1 22 -15
1 230 -1500
2 24 -15
2 20 -25
2 23 -15
3 28 -20
3 27 -25
;
run;


data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
1 231 1400
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

data pos_a ;
	set a;
	pay=-pay;
run;

proc sort data=b;
	by ID pay;
run;

proc sort data=pos_a;
	by ID pay;
run;

data non_matched;
	merge b(in=C) pos_a(in=R);
	by id pay;
	if R=0 then output;* this is in b but not in pos_a;
	else if C=0 then do;*this is in pos_a but not in b;
	pay=-pay;*convert it back to to its original negtive form;
	output;
	end;
run;

 

unison
Lapis Lazuli | Level 10

Looks like transaction data? You're trying to keep 'un-balanced' transactions?

Take a look at this...

 

data a;
	input id seqnum pay;
	datalines;
1 12 -10
1 13 -5
2 16 -15
3 20 -20
;
run;

data b;
	input id seqnum pay;
	datalines;
1 8 10
1 9 10
1 10 10
1 11 5
1 14 15
2 15 15
2 17 25
2 18 15
3 19 20
3 21 25
;
run;

data desired_output;
	input id seqnum pay;
	datalines;
1 9 10
1 10 10
1 14 15
2 17 25
2 18 15
3 21 25
;
run;

/* Edit a so that pay=-1*pay */
data a_edited(rename=(pay_neg=pay));
	set a;
	pay_neg=-1*pay;
	drop pay;
run;


/* Sort and merge */
proc sort data=b;
	by id pay;
run;

proc sort data=a_edited;
	by id pay;
run;

data want;
	merge b(in=B) a_edited(in=a rename=(seqnum=seqnum_a));
	by id pay;

	if b;

	if not first.pay or seqnum_a=.;
	drop seqnum_a;
run;

/* Checking that desired_output=want */
proc sort data=want;
	by id seqnum;
run;

proc compare base=desired_output compare=want;
run;
-unison

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 1558 views
  • 3 likes
  • 4 in conversation