BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
yliu1234
Obsidian | Level 7

I have a transaction table, for the same account and same amount, if there are one debit and one credit transactions, then delete both.  Each debit transaction should cancel out one equal amount credit transaction of the same account, the sequence doesn't matter.  

 

 

account_idamounttypetrans_seq
001100credit1
001100credit2
001100debit1
001100debit32
00150credit3
00230credit1
00350credit1
00320debit1
000123abc104.42credit2
000123abc104.42credit2
000123abc104.42debit2
00410credit1
00410credit2
00510debit1
00510debit2
00510debit3

 

data tran;
input account_id $ amount type $ trans_seq;
cards;
001 100 credit 1
001 100 credit 2
001 100 debit 1
001 100 debit 32
001 50 credit 3
002 30 credit 1
003 50 credit 1
003 20 debit 1

000123abc 104.42 credit 2
000123abc 104.42 credit 14
000123abc 104.42 debit 2

004 10 credit 1

004 10 credit 2

005 10 debit 1

005 10 debit 2

005 10 debit 3;
run;

the table I want

 

account_idamounttypetrans_seq
00150credit3
00230credit1
00350credit1
00320debit1
000123abc104.42credit2
00410credit1
00410credit2
00510debit1
00510debit2
00510debit3
    
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

A low-tech solution would be:

 

roc sort data=tran; 
by account_id amount type trans_seq; 
run;

data temp;
set tran; 
by account_id amount type;
if first.type then seq = 0;
seq + 1;
run;

data want;
merge temp(where=(type="credit") in=c) 
      temp(where=(type="debit") in=d);
by account_id amount seq;
if not (c and d);
drop seq;
run;
PG

View solution in original post

15 REPLIES 15
novinosrin
Tourmaline | Level 20
data have;
input account_id	amount	type $	trans_seq;
cards;
001	100	credit	1
001	100	credit	2
001	100	debit	1
001	100	debit	32
001	50	credit	3
002	30	credit	1
003	50	credit	1
003	20	debit	1
;

data want;
 if _n_=1 then do;
  dcl hash H (multidata:'y') ;
  h.definekey  ("amount") ;
  h.definedone () ;
 end;
 do _n_=1 by 1 until(last.account_id);
  set have;
  by account_id;
  if type="credit" then h.add();
  else if type="debit" then do;
   if h.check()=0 then rc=h.removedup();
   else h.add();
  end;
 end;
 do _n_=1 to _n_;
  set have;
  if h.check()= 0 then output;
 end;
 h.clear();
 drop rc;
run;

proc print noobs;run;
account_id amount type trans_seq
1 50 credit 3
2 30 credit 1
3 50 credit 1
3 20 debit 1
yliu1234
Obsidian | Level 7
novinosrin, thank you very much. I am very new to SAS, could you add a little bit explanation of your code or your logic?

novinosrin
Tourmaline | Level 20

Hi @yliu1234 It's a simple look up exercise to begin with. The logic construct deals with pairwise look-up. A pair is defined as a combination of credit, debit for a particular amount. You can imagine this as one group. The parent group is your account_id within which each pair of equal credit and debit amount is constituted as child group.  In other words, you have something like sets of 

 

Group by Accountid, Amount, (credit & debit)  =>sets of two

 

You create a memory resident(Hash) look up table and retain the records that do not conform to the defined pairing logic i.e. by excluding the pairs. You write the retained records to the resulting dataset and clear the contents of Hash memory once the processing for each by group is complete. The same is repeated for each Accountid and we are done!

yliu1234
Obsidian | Level 7

my account_id is saved as character, I add three rows here, it will not work.  Should I add a number index for my account_id? I wonder how to do this?

novinosrin
Tourmaline | Level 20

Okay, It's not about character or numeric that caused the problem. It's about the duplicate credit of 104.42. No worries, I am a little busy now. I will respond to you by the evening.

 

Also, @yliu1234  I have understood that one scenario. However, Can you please post a comprehensive sample of all possible scenarios of transaction pattern with the expected output?

novinosrin
Tourmaline | Level 20

Hi @yliu1234  Please try the modified code below and let me know. I believe this should do it.

 

data tran;
input account_id $ amount type $ trans_seq;
cards;
001 100 credit 1
001 100 credit 2
001 100 debit 1
001 100 debit 32
001 50 credit 3
002 30 credit 1
003 50 credit 1
003 20 debit 1
000123abc 104.42 credit 2
000123abc 104.42 credit 14
000123abc 104.42 debit 2
;
RUN;

data want;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("_amount") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.account_id);
  set tran;
  by account_id notsorted;
  _amount=ifn(type='credit',amount,-amount);
  if type="credit" then h.add();
  else if type="debit" then do;
   if h.check(key:abs(_amount))=0 then rc=h.removedup(key:abs(_amount));
   else h.add();
  end;
 end;
 do _n_=1 to _n_;
  set tran;
  _amount=ifn(type='credit',amount,-amount);
  if h.check()=0 then do;
   output;
   rc=h.removedup();
  end;
 end;
 h.clear();
 drop rc _:;
run;
proc print noobs;run;
yliu1234
Obsidian | Level 7

@novinosrin you code works very well in my sample, however, my account_id is very long, it has 32 digit. like this abc12345678901234567890123456789. I applied your logic to my real data, it gives me a empty table as result.

novinosrin
Tourmaline | Level 20

Hi @yliu1234  I really am not sure if the long account_id has anything to do with it unless of course if your sample is not good representative of the real. The logic essentially is still the same with just minor tweaks. Hmm, it may be likely that your amount decimal value perhaps is the causing the problem and may need rounding?You could however round the value to the nearest 100th with round function like new_amount=round(amount,.01).

 

Also, try to debug using just one account_id and see what works and what isn't working. I would like to work on a simpler multistep logic but I am a little way of what's in your data at this point. Is it something slipping through the cracks in the communication? 

yliu1234
Obsidian | Level 7

Hi @novinosrin the code works good now! The error previously was due to failure of changing 'type' to the real column name .

 

I have studied some Dash document to help me understand your code.

 do _n_=1 by 1 until(last.account_id);
  set tran;
  by account_id notsorted;
  _amount=ifn(type='credit',amount,-amount);
  if type="credit" then h.add();
  else if type="debit" then do;
   if h.check(key:abs(_amount))=0 then rc=h.removedup(key:abs(_amount));
   else h.add();

Then I ran this part only to understand your  logic.

if you find a credit, you add it to hash table,

if you find a debit, you check whether there is a credit or debit already exist, if exist, delete all previous ones with removedup(), if not, add it here? However, the result I get is like this, it doesn't follow my logic.

account_idamounttypetrans_seq_amountrc
000123ab104.42debit2-104.420
001100debit32-1000
00230credit130 
00350credit150 
00410credit210 
00510debit3-100

 

 

do _n_=1 to _n_;
  set tran;
  _amount=ifn(type='credit',amount,-amount);
  if h.check()=0 then do;
   output;
   rc=h.removedup();

the second part of your code, you checked the tran table again, but without groupby account_id? if one _amount(credit positive, debit negative) exist already in the previous hash table, then output -- write it to want table? then remove previous record with the same amount? Since it is not grouped by account_id, which means the same amount of different user will works here too.

 

Could you correct me where did I read wrong? since your code works very well.

 

 by the way I add this before your code, to make sure that credit always shows first than debit.

PROC SORT data=tran;
    BY account_id amount type;
RUN; 

 

novinosrin
Tourmaline | Level 20

Hi @yliu1234  Thank you for the responses and I'm very glad you are being receptive. Basically your understanding of the idea behind the logic is correct.  

The initial processing below,

 

 do _n_=1 by 1 until(last.account_id);
  set tran;
  by account_id;

 

keeps a count of the the number of records for the by group processed for an account_id and the records are read/saved in a memory area called page buffer. Therefore  the need for another by account_id isn't required when we re-read the observations from the same group. This is where we move to recollecting the concepts of general "computer science programming" so to speak where how an I/O(Input/output) occurs and measured. During the 1st read, of course the dataset have to be opened, By group(first and last obs) markers have to be set by compiler and then observations are read into the page buffer, and from there into the program data vector and eventually to output buffer and dataset. 

 

Once the 1st processing is complete, the count value of _n_ has already determined how many records have been read into page buffer(cache) and so all we need to do is re-read the same number of records that exist in the page buffer that is holding it. At the end of the datastep iteration, the contents of the page buffer memory is released and waits for the next set of data to be read.

 

Therefore, in a nutshell, we read the contents of the page buffer twice or in other words to say the equivalent we process a By group twice (smiles)  with increased level of efficiency in avoiding the compiler to do its work twice. 

Example:

do _n_=1 to _n_;
  set tran;

 

Logic:

Part A- When processing by group for the 1st time

1. Impute a temp variable as (-ve) for debit and (+ve) for positive. This makes it easy to have the values in the form of binary values(+ vs -). 

2.Check for pairwise look up when processing -ve subsequent to adding +ve first in the hash table.

3. When 2 is true, eliminate each pair one by one.

4. The remaining contents are the singles that will not have its partner pair. 

Part B- When processing by group for the 2nd time

1. Check for the remaining singles in the hash table and write that to the resulting table.

2. If 1 is true, make sure the matching single is cleared from the Hash table

3. Repeat the same for the entire by group processing

and we are done 🙂😊

 

I hope the above helps? Please let me know. Have fun!

 

yliu1234
Obsidian | Level 7
@novinosrin I have fully understood your code now! looks like I can't run the part1 only, it will give me a false result which doesn't interpret step 1 correctly.

What you did is a great help to me and my project! Thank you so much! Wish you have a great day!
novinosrin
Tourmaline | Level 20

@yliu1234  Yes Part A is preliminary step that we play with in-memory processing within the contents of Hash table. So the communication here nor the processing has anything to do with writing results to an output dataset. It's merely a step that creates interdependence for Part B, which writes to the result dataset. 

PGStats
Opal | Level 21

A low-tech solution would be:

 

roc sort data=tran; 
by account_id amount type trans_seq; 
run;

data temp;
set tran; 
by account_id amount type;
if first.type then seq = 0;
seq + 1;
run;

data want;
merge temp(where=(type="credit") in=c) 
      temp(where=(type="debit") in=d);
by account_id amount seq;
if not (c and d);
drop seq;
run;
PG
yliu1234
Obsidian | Level 7
Hi @PGStats, thanks for providing your simple solution, I like your code very much and it works very good!

I am new to merge function. I assume, when you merge two tables, it use 'by account_id amount seq' these columns as merge key, all other common columns will use the second table's value if the key is available and both tables are contributors, so both c and d are true.
Did I understand merge function right?

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1238 views
  • 6 likes
  • 3 in conversation