I have a transaction table, for the same account and same amount, if there are one debit and one credit transactions, then delete both. Each debit transaction should cancel out one equal amount credit transaction of the same account, the sequence doesn't matter.
account_id | amount | type | trans_seq |
001 | 100 | credit | 1 |
001 | 100 | credit | 2 |
001 | 100 | debit | 1 |
001 | 100 | debit | 32 |
001 | 50 | credit | 3 |
002 | 30 | credit | 1 |
003 | 50 | credit | 1 |
003 | 20 | debit | 1 |
000123abc | 104.42 | credit | 2 |
000123abc | 104.42 | credit | 2 |
000123abc | 104.42 | debit | 2 |
004 | 10 | credit | 1 |
004 | 10 | credit | 2 |
005 | 10 | debit | 1 |
005 | 10 | debit | 2 |
005 | 10 | debit | 3 |
data tran;
input account_id $ amount type $ trans_seq;
cards;
001 100 credit 1
001 100 credit 2
001 100 debit 1
001 100 debit 32
001 50 credit 3
002 30 credit 1
003 50 credit 1
003 20 debit 1
000123abc 104.42 credit 2
000123abc 104.42 credit 14
000123abc 104.42 debit 2
004 10 credit 1
004 10 credit 2
005 10 debit 1
005 10 debit 2
005 10 debit 3;
run;
the table I want
account_id | amount | type | trans_seq |
001 | 50 | credit | 3 |
002 | 30 | credit | 1 |
003 | 50 | credit | 1 |
003 | 20 | debit | 1 |
000123abc | 104.42 | credit | 2 |
004 | 10 | credit | 1 |
004 | 10 | credit | 2 |
005 | 10 | debit | 1 |
005 | 10 | debit | 2 |
005 | 10 | debit | 3 |
A low-tech solution would be:
roc sort data=tran;
by account_id amount type trans_seq;
run;
data temp;
set tran;
by account_id amount type;
if first.type then seq = 0;
seq + 1;
run;
data want;
merge temp(where=(type="credit") in=c)
temp(where=(type="debit") in=d);
by account_id amount seq;
if not (c and d);
drop seq;
run;
data have;
input account_id amount type $ trans_seq;
cards;
001 100 credit 1
001 100 credit 2
001 100 debit 1
001 100 debit 32
001 50 credit 3
002 30 credit 1
003 50 credit 1
003 20 debit 1
;
data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
h.definekey ("amount") ;
h.definedone () ;
end;
do _n_=1 by 1 until(last.account_id);
set have;
by account_id;
if type="credit" then h.add();
else if type="debit" then do;
if h.check()=0 then rc=h.removedup();
else h.add();
end;
end;
do _n_=1 to _n_;
set have;
if h.check()= 0 then output;
end;
h.clear();
drop rc;
run;
proc print noobs;run;
account_id | amount | type | trans_seq |
---|---|---|---|
1 | 50 | credit | 3 |
2 | 30 | credit | 1 |
3 | 50 | credit | 1 |
3 | 20 | debit | 1 |
Hi @yliu1234 It's a simple look up exercise to begin with. The logic construct deals with pairwise look-up. A pair is defined as a combination of credit, debit for a particular amount. You can imagine this as one group. The parent group is your account_id within which each pair of equal credit and debit amount is constituted as child group. In other words, you have something like sets of
Group by Accountid, Amount, (credit & debit) =>sets of two
You create a memory resident(Hash) look up table and retain the records that do not conform to the defined pairing logic i.e. by excluding the pairs. You write the retained records to the resulting dataset and clear the contents of Hash memory once the processing for each by group is complete. The same is repeated for each Accountid and we are done!
my account_id is saved as character, I add three rows here, it will not work. Should I add a number index for my account_id? I wonder how to do this?
Okay, It's not about character or numeric that caused the problem. It's about the duplicate credit of 104.42. No worries, I am a little busy now. I will respond to you by the evening.
Also, @yliu1234 I have understood that one scenario. However, Can you please post a comprehensive sample of all possible scenarios of transaction pattern with the expected output?
Hi @yliu1234 Please try the modified code below and let me know. I believe this should do it.
data tran;
input account_id $ amount type $ trans_seq;
cards;
001 100 credit 1
001 100 credit 2
001 100 debit 1
001 100 debit 32
001 50 credit 3
002 30 credit 1
003 50 credit 1
003 20 debit 1
000123abc 104.42 credit 2
000123abc 104.42 credit 14
000123abc 104.42 debit 2
;
RUN;
data want;
if _n_=1 then do;
dcl hash H (multidata:'y') ;
h.definekey ("_amount") ;
h.definedone () ;
end;
do _n_=1 by 1 until(last.account_id);
set tran;
by account_id notsorted;
_amount=ifn(type='credit',amount,-amount);
if type="credit" then h.add();
else if type="debit" then do;
if h.check(key:abs(_amount))=0 then rc=h.removedup(key:abs(_amount));
else h.add();
end;
end;
do _n_=1 to _n_;
set tran;
_amount=ifn(type='credit',amount,-amount);
if h.check()=0 then do;
output;
rc=h.removedup();
end;
end;
h.clear();
drop rc _:;
run;
proc print noobs;run;
@novinosrin you code works very well in my sample, however, my account_id is very long, it has 32 digit. like this abc12345678901234567890123456789. I applied your logic to my real data, it gives me a empty table as result.
Hi @yliu1234 I really am not sure if the long account_id has anything to do with it unless of course if your sample is not good representative of the real. The logic essentially is still the same with just minor tweaks. Hmm, it may be likely that your amount decimal value perhaps is the causing the problem and may need rounding?You could however round the value to the nearest 100th with round function like new_amount=round(amount,.01).
Also, try to debug using just one account_id and see what works and what isn't working. I would like to work on a simpler multistep logic but I am a little way of what's in your data at this point. Is it something slipping through the cracks in the communication?
Hi @novinosrin the code works good now! The error previously was due to failure of changing 'type' to the real column name .
I have studied some Dash document to help me understand your code.
do _n_=1 by 1 until(last.account_id);
set tran;
by account_id notsorted;
_amount=ifn(type='credit',amount,-amount);
if type="credit" then h.add();
else if type="debit" then do;
if h.check(key:abs(_amount))=0 then rc=h.removedup(key:abs(_amount));
else h.add();
Then I ran this part only to understand your logic.
if you find a credit, you add it to hash table,
if you find a debit, you check whether there is a credit or debit already exist, if exist, delete all previous ones with removedup(), if not, add it here? However, the result I get is like this, it doesn't follow my logic.
account_id | amount | type | trans_seq | _amount | rc |
000123ab | 104.42 | debit | 2 | -104.42 | 0 |
001 | 100 | debit | 32 | -100 | 0 |
002 | 30 | credit | 1 | 30 | |
003 | 50 | credit | 1 | 50 | |
004 | 10 | credit | 2 | 10 | |
005 | 10 | debit | 3 | -10 | 0 |
do _n_=1 to _n_;
set tran;
_amount=ifn(type='credit',amount,-amount);
if h.check()=0 then do;
output;
rc=h.removedup();
the second part of your code, you checked the tran table again, but without groupby account_id? if one _amount(credit positive, debit negative) exist already in the previous hash table, then output -- write it to want table? then remove previous record with the same amount? Since it is not grouped by account_id, which means the same amount of different user will works here too.
Could you correct me where did I read wrong? since your code works very well.
by the way I add this before your code, to make sure that credit always shows first than debit.
PROC SORT data=tran;
BY account_id amount type;
RUN;
Hi @yliu1234 Thank you for the responses and I'm very glad you are being receptive. Basically your understanding of the idea behind the logic is correct.
The initial processing below,
do _n_=1 by 1 until(last.account_id);
set tran;
by account_id;
keeps a count of the the number of records for the by group processed for an account_id and the records are read/saved in a memory area called page buffer. Therefore the need for another by account_id isn't required when we re-read the observations from the same group. This is where we move to recollecting the concepts of general "computer science programming" so to speak where how an I/O(Input/output) occurs and measured. During the 1st read, of course the dataset have to be opened, By group(first and last obs) markers have to be set by compiler and then observations are read into the page buffer, and from there into the program data vector and eventually to output buffer and dataset.
Once the 1st processing is complete, the count value of _n_ has already determined how many records have been read into page buffer(cache) and so all we need to do is re-read the same number of records that exist in the page buffer that is holding it. At the end of the datastep iteration, the contents of the page buffer memory is released and waits for the next set of data to be read.
Therefore, in a nutshell, we read the contents of the page buffer twice or in other words to say the equivalent we process a By group twice (smiles) with increased level of efficiency in avoiding the compiler to do its work twice.
Example:
do _n_=1 to _n_;
set tran;
Logic:
Part A- When processing by group for the 1st time
1. Impute a temp variable as (-ve) for debit and (+ve) for positive. This makes it easy to have the values in the form of binary values(+ vs -).
2.Check for pairwise look up when processing -ve subsequent to adding +ve first in the hash table.
3. When 2 is true, eliminate each pair one by one.
4. The remaining contents are the singles that will not have its partner pair.
Part B- When processing by group for the 2nd time
1. Check for the remaining singles in the hash table and write that to the resulting table.
2. If 1 is true, make sure the matching single is cleared from the Hash table
3. Repeat the same for the entire by group processing
and we are done 🙂😊
I hope the above helps? Please let me know. Have fun!
@yliu1234 Yes Part A is preliminary step that we play with in-memory processing within the contents of Hash table. So the communication here nor the processing has anything to do with writing results to an output dataset. It's merely a step that creates interdependence for Part B, which writes to the result dataset.
A low-tech solution would be:
roc sort data=tran;
by account_id amount type trans_seq;
run;
data temp;
set tran;
by account_id amount type;
if first.type then seq = 0;
seq + 1;
run;
data want;
merge temp(where=(type="credit") in=c)
temp(where=(type="debit") in=d);
by account_id amount seq;
if not (c and d);
drop seq;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.