Solved: Re: delete rows with two rows condition after group by

yliu1234 · Posted 05-11-2020 10:09 AM

I have a transaction table, for the same account and same amount, if there are one debit and one credit transactions, then delete both. Each debit transaction should cancel out one equal amount credit transaction of the same account, the sequence doesn't matter.

account_id	amount	type	trans_seq
001	100	credit	1
001	100	credit	2
001	100	debit	1
001	100	debit	32
001	50	credit	3
002	30	credit	1
003	50	credit	1
003	20	debit	1
000123abc	104.42	credit	2
000123abc	104.42	credit	2
000123abc	104.42	debit	2
004	10	credit	1
004	10	credit	2
005	10	debit	1
005	10	debit	2
005	10	debit	3

data tran;
input account_id $ amount type $ trans_seq;
cards;
001 100 credit 1
001 100 credit 2
001 100 debit 1
001 100 debit 32
001 50 credit 3
002 30 credit 1
003 50 credit 1
003 20 debit 1

000123abc 104.42 credit 2
000123abc 104.42 credit 14
000123abc 104.42 debit 2

004 10 credit 1

004 10 credit 2

005 10 debit 1

005 10 debit 2

005 10 debit 3;
run;

the table I want

account_id	amount	type	trans_seq
001	50	credit	3
002	30	credit	1
003	50	credit	1
003	20	debit	1
000123abc	104.42	credit	2
004	10	credit	1
004	10	credit	2
005	10	debit	1
005	10	debit	2
005	10	debit	3

PGStats · Posted 05-12-2020 01:09 PM

A low-tech solution would be:

roc sort data=tran; 
by account_id amount type trans_seq; 
run;

data temp;
set tran; 
by account_id amount type;
if first.type then seq = 0;
seq + 1;
run;

data want;
merge temp(where=(type="credit") in=c) 
      temp(where=(type="debit") in=d);
by account_id amount seq;
if not (c and d);
drop seq;
run;

PG

View solution in original post

novinosrin · Posted 05-11-2020 10:54 AM

data have;
input account_id	amount	type $	trans_seq;
cards;
001	100	credit	1
001	100	credit	2
001	100	debit	1
001	100	debit	32
001	50	credit	3
002	30	credit	1
003	50	credit	1
003	20	debit	1
;

data want;
 if _n_=1 then do;
  dcl hash H (multidata:'y') ;
  h.definekey  ("amount") ;
  h.definedone () ;
 end;
 do _n_=1 by 1 until(last.account_id);
  set have;
  by account_id;
  if type="credit" then h.add();
  else if type="debit" then do;
   if h.check()=0 then rc=h.removedup();
   else h.add();
  end;
 end;
 do _n_=1 to _n_;
  set have;
  if h.check()= 0 then output;
 end;
 h.clear();
 drop rc;
run;

proc print noobs;run;

account_id	amount	type	trans_seq
1	50	credit	3
2	30	credit	1
3	50	credit	1
3	20	debit	1

yliu1234 · Posted 05-11-2020 11:17 AM

novinosrin, thank you very much. I am very new to SAS, could you add a little bit explanation of your code or your logic?

novinosrin · Posted 05-11-2020 11:29 AM

Hi @yliu1234 It's a simple look up exercise to begin with. The logic construct deals with pairwise look-up. A pair is defined as a combination of credit, debit for a particular amount. You can imagine this as one group. The parent group is your account_id within which each pair of equal credit and debit amount is constituted as child group. In other words, you have something like sets of

Group by Accountid, Amount, (credit & debit)  =>sets of two

You create a memory resident(Hash) look up table and retain the records that do not conform to the defined pairing logic i.e. by excluding the pairs. You write the retained records to the resulting dataset and clear the contents of Hash memory once the processing for each by group is complete. The same is repeated for each Accountid and we are done!

yliu1234 · Posted 05-11-2020 01:10 PM

my account_id is saved as character, I add three rows here, it will not work. Should I add a number index for my account_id? I wonder how to do this?

novinosrin · Posted 05-11-2020 01:32 PM

Okay, It's not about character or numeric that caused the problem. It's about the duplicate credit of 104.42. No worries, I am a little busy now. I will respond to you by the evening.

Also, @yliu1234 I have understood that one scenario. However, Can you please post a comprehensive sample of all possible scenarios of transaction pattern with the expected output?

novinosrin · Posted 05-11-2020 02:18 PM

Hi @yliu1234 Please try the modified code below and let me know. I believe this should do it.

data tran;
input account_id $ amount type $ trans_seq;
cards;
001 100 credit 1
001 100 credit 2
001 100 debit 1
001 100 debit 32
001 50 credit 3
002 30 credit 1
003 50 credit 1
003 20 debit 1
000123abc 104.42 credit 2
000123abc 104.42 credit 14
000123abc 104.42 debit 2
;
RUN;

data want;
 if _n_=1 then do;
   dcl hash H (multidata:'y') ;
   h.definekey  ("_amount") ;
   h.definedone () ;
 end;
 do _n_=1 by 1 until(last.account_id);
  set tran;
  by account_id notsorted;
  _amount=ifn(type='credit',amount,-amount);
  if type="credit" then h.add();
  else if type="debit" then do;
   if h.check(key:abs(_amount))=0 then rc=h.removedup(key:abs(_amount));
   else h.add();
  end;
 end;
 do _n_=1 to _n_;
  set tran;
  _amount=ifn(type='credit',amount,-amount);
  if h.check()=0 then do;
   output;
   rc=h.removedup();
  end;
 end;
 h.clear();
 drop rc _:;
run;
proc print noobs;run;

yliu1234 · Posted 05-11-2020 04:17 PM

@novinosrin you code works very well in my sample, however, my account_id is very long, it has 32 digit. like this abc12345678901234567890123456789. I applied your logic to my real data, it gives me a empty table as result.

novinosrin · Posted 05-11-2020 04:27 PM

Hi @yliu1234 I really am not sure if the long account_id has anything to do with it unless of course if your sample is not good representative of the real. The logic essentially is still the same with just minor tweaks. Hmm, it may be likely that your amount decimal value perhaps is the causing the problem and may need rounding?You could however round the value to the nearest 100th with round function like new_amount=round(amount,.01).

Also, try to debug using just one account_id and see what works and what isn't working. I would like to work on a simpler multistep logic but I am a little way of what's in your data at this point. Is it something slipping through the cracks in the communication?

yliu1234 · Posted 05-12-2020 11:25 AM

Hi @novinosrin the code works good now! The error previously was due to failure of changing 'type' to the real column name .

I have studied some Dash document to help me understand your code.

 do _n_=1 by 1 until(last.account_id);
  set tran;
  by account_id notsorted;
  _amount=ifn(type='credit',amount,-amount);
  if type="credit" then h.add();
  else if type="debit" then do;
   if h.check(key:abs(_amount))=0 then rc=h.removedup(key:abs(_amount));
   else h.add();

Then I ran this part only to understand your logic.

if you find a credit, you add it to hash table,

if you find a debit, you check whether there is a credit or debit already exist, if exist, delete all previous ones with removedup(), if not, add it here? However, the result I get is like this, it doesn't follow my logic.

account_id	amount	type	trans_seq	_amount	rc
000123ab	104.42	debit	2	-104.42	0
001	100	debit	32	-100	0
002	30	credit	1	30
003	50	credit	1	50
004	10	credit	2	10
005	10	debit	3	-10	0

do _n_=1 to _n_;
  set tran;
  _amount=ifn(type='credit',amount,-amount);
  if h.check()=0 then do;
   output;
   rc=h.removedup();

the second part of your code, you checked the tran table again, but without groupby account_id? if one _amount(credit positive, debit negative) exist already in the previous hash table, then output -- write it to want table? then remove previous record with the same amount? Since it is not grouped by account_id, which means the same amount of different user will works here too.

Could you correct me where did I read wrong? since your code works very well.

by the way I add this before your code, to make sure that credit always shows first than debit.

PROC SORT data=tran;
    BY account_id amount type;
RUN;

novinosrin · Posted 05-12-2020 12:01 PM

Hi @yliu1234 Thank you for the responses and I'm very glad you are being receptive. Basically your understanding of the idea behind the logic is correct.

The initial processing below,

 do _n_=1 by 1 until(last.account_id);
  set tran;
  by account_id;

keeps a count of the the number of records for the by group processed for an account_id and the records are read/saved in a memory area called page buffer. Therefore the need for another by account_id isn't required when we re-read the observations from the same group. This is where we move to recollecting the concepts of general "computer science programming" so to speak where how an I/O(Input/output) occurs and measured. During the 1st read, of course the dataset have to be opened, By group(first and last obs) markers have to be set by compiler and then observations are read into the page buffer, and from there into the program data vector and eventually to output buffer and dataset.

Once the 1st processing is complete, the count value of _n_ has already determined how many records have been read into page buffer(cache) and so all we need to do is re-read the same number of records that exist in the page buffer that is holding it. At the end of the datastep iteration, the contents of the page buffer memory is released and waits for the next set of data to be read.

Therefore, in a nutshell, we read the contents of the page buffer twice or in other words to say the equivalent we process a By group twice (smiles) with increased level of efficiency in avoiding the compiler to do its work twice.

Example:

do _n_=1 to _n_;
  set tran;

Logic:

Part A- When processing by group for the 1st time

1. Impute a temp variable as (-ve) for debit and (+ve) for positive. This makes it easy to have the values in the form of binary values(+ vs -).

2.Check for pairwise look up when processing -ve subsequent to adding +ve first in the hash table.

3. When 2 is true, eliminate each pair one by one.

4. The remaining contents are the singles that will not have its partner pair.

Part B- When processing by group for the 2nd time

1. Check for the remaining singles in the hash table and write that to the resulting table.

2. If 1 is true, make sure the matching single is cleared from the Hash table

3. Repeat the same for the entire by group processing

and we are done 🙂😊

I hope the above helps? Please let me know. Have fun!

yliu1234 · Posted 05-12-2020 02:23 PM

@novinosrin I have fully understood your code now! looks like I can't run the part1 only, it will give me a false result which doesn't interpret step 1 correctly.

What you did is a great help to me and my project! Thank you so much! Wish you have a great day!

novinosrin · Posted 05-12-2020 02:36 PM

@yliu1234 Yes Part A is preliminary step that we play with in-memory processing within the contents of Hash table. So the communication here nor the processing has anything to do with writing results to an output dataset. It's merely a step that creates interdependence for Part B, which writes to the result dataset.

PGStats · Posted 05-12-2020 01:09 PM

A low-tech solution would be:

roc sort data=tran; 
by account_id amount type trans_seq; 
run;

data temp;
set tran; 
by account_id amount type;
if first.type then seq = 0;
seq + 1;
run;

data want;
merge temp(where=(type="credit") in=c) 
      temp(where=(type="debit") in=d);
by account_id amount seq;
if not (c and d);
drop seq;
run;

PG

yliu1234 · Posted 05-15-2020 05:18 PM

Hi @PGStats, thanks for providing your simple solution, I like your code very much and it works very good!

I am new to merge function. I assume, when you merge two tables, it use 'by account_id amount seq' these columns as merge key, all other common columns will use the second table's value if the key is available and both tables are contributors, so both c and d are true.
Did I understand merge function right?

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!