DATA Step, Macro, Functions and more

Flagging Duplicates across multiple Rows

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Flagging Duplicates across multiple Rows

Hi

 

I am trying to loop through a file and flag all accounts that meet a specific criteria.

I want to to set the flag value to 1 on records that have same customer numbers, account numbers, date and opposite amounts.

 

What I want to do is:

  1. Check the number of records that have the same customer number and date  and assign value k (I have this part working)
  2. For each cluster of customer number and date flag accounts with the opposite amount values
  3. Once it has completed a cluster it should move onto the next set of records
  4. There is no point in comparing across clusters

 

Data and result I am aiming for: (create new variable flag)

 

Customer NumberAccount NumberDateAmountkFlag 
C001Acc0012015/06/0355010 
C001Acc0012015/06/08-19971Clustering
C001Acc0022015/06/0819971
C001Acc0012015/06/08385070
C001Acc0032015/06/08-385070
C001Acc0012015/06/08-15776.7571
C001Acc0012015/06/08999.2370
C001Acc0032015/06/0815776.7571
C001Acc0012015/06/26-1000010 
C001Acc0012015/06/29440020Clustering
C001Acc0012015/06/29-850020
C001Acc0012015/06/30-150021
C001Acc0022015/06/30150021
C002Various Account NumberVarious Datesxxxxxx 
C003Various Account NumberVarious Datesxxxxxx 

 

I have built the following code logic:

 

/* Macro to SCAN through DATALOG */									
%MACRO SCANLOOP(SCANFILE,FIELD1,FIELD2, FIELD3, FIELD4, FIELD5);									
/* First obtain the number of records in DATALOG */									
									
DATA _NULL_;									
	IF 0 THEN SET &SCANFILE NOBS=X;								
	CALL SYMPUTX('RECCOUNT',X);								
	%put &RECCOUNT.;								
STOP;									
RUN;									
									
/* loop from one to number of records */									
/*%DO I=1 %TO &RECCOUNT;*/									
%DO I=1 %TO 100;									
									
DATA _NULL_;									
	/* Advance to the Ith record */								
	SET &SCANFILE (FIRSTOBS=&I);								
	/* store the variables of interest in */								
	/* macro variables */								
/*	Symput is for characters*/								
/*	symputx is for numeric to character conversion with removal*/								
/*	of leading and trailing spaces*/								
	CALL SYMPUT('VAR1',&FIELD1);								
	CALL SYMPUTX('VAR2',&FIELD2);								
	CALL SYMPUTX('VAR3',&FIELD3);								
	CALL SYMPUTX('VAR4',&FIELD4);								
	CALL SYMPUTX('VAR5',&FIELD5);								
	CALL SYMPUTX('VAR6',k);								
									
									
	%put &VAR1. &VAR2. &VAR3. &VAR4. &VAR5. &VAR6.;								
									
									
STOP;									
RUN;									
									
	/* now perform the tasks that */								
	/* wish repeated for each */								
	/* observation */								
									
/*SETUP a blank file */									
/*DATA RECON5;*/									
/*	if &i=0 then SET recon4 (OBS=0);*/								
/*run;*/									
									
									
DATA recon5;									
	SET recon5 recon4;								
									
	IF Narrative ^= 'DAILY BALANCE'								
		AND customer_number = symget('VAR2')							
			AND date = symget('VAR3')						
				AND account_number NOT = symget('VAR4')					
					AND amount = -(symget('VAR5')) THEN DO;				
						flag=1;			
						output recon5; 			
					END;				
						ELSE flag=0;			
									
									
									
/*input(original_variable, informat.)*/									
									
									
%END;									
%MEND SCANLOOP;									
/* Call SCANLOOP */									
/*%SCANLOOP(DATALOG,FILENM,DESC);*/									
%SCANLOOP(recon4,narrative,customer_number,date,account_number,amount);									
RUN;									
  1.  The current issues with the code is that it does not write the results to a single consolidated file
  2. It has no idea on how to use k
    • K is counts the number of records belonging to a customer with the same date
    • Ideally the logic should use this and help reduce time to process records, by jumping across records

 


Accepted Solutions
Solution
‎12-13-2016 07:02 AM
Super User
Posts: 10,028

Re: Flagging Duplicates across multiple Rows

That would be very complicated.

You also need to consider whether the positive number is greater than negative number .

 

data have;
infile cards expandtabs truncover;
input CustomerNumber $	AccountNumber $ Date : yymmdd10. Amount	k;
format date yymmdd10.;
cards;
C001	Acc001	2015/06/03	550	1	0	 	0
C001	Acc001	2015/06/08	-199	7	1	Payment to acc002	1
C001	Acc002	2015/06/08	199	7	0	McDonalds	1
C001	Acc002	2015/06/08	199	7	1	Receipt acc002	1
C001	Acc002	2015/06/08	50	7	0	Toll Charge	1
C001	Acc002	2015/06/08	50	7	0	Toll Charge	1
C001	Acc001	2015/06/08	3850	7	1	 	1
C001	Acc003	2015/06/08	-3850	7	1	 	1
C001	Acc001	2015/06/08	-15776.75	7	1	 	1
C001	Acc001	2015/06/08	999.23	7	0	 	0
C001	Acc003	2015/06/08	15776.75	7	1	 	1
C001	Acc001	2015/06/26	-10000	1	0	 	0
C001	Acc001	2015/06/29	4400	2	0	 	0
C001	Acc001	2015/06/29	-8500	2	0	 	0
C001	Acc001	2015/06/30	-1500	2	1	 	1
C001	Acc002	2015/06/30	1500	2	1
;
run;
data temp;
 set have;
 by k notsorted;
 group+first.k;
 abs=abs(amount);
run;
proc sort data=temp;by group abs;run;
data positive negative(rename=(CustomerNumber=_c
AccountNumber=_ac Date=_d Amount=_am k=_k));
 set temp;
 if amount gt 0 then output positive;
  else output negative;
run;
data x;
 ina=0;inb=0;
 merge positive(in=ina) negative(in=inb);
 by group abs;
 one=ina;two=inb;
 output;
 call missing(of _all_);
run;

data want;
 set x;
 if one and not two then do;flag=0;output;end;
  else if not one and two then do;
   CustomerNumber=_c;
   AccountNumber=_ac;
   Date=_d;
   Amount=_am; 
   k=_k;
   flag=0;output;
  end;
  else do;
   flag=1;output;
   CustomerNumber=_c;
   AccountNumber=_ac;
   Date=_d;
   Amount=_am; 
   k=_k;
   output;
  end;
drop  group abs one two _:;
run;

View solution in original post


All Replies
Super User
Posts: 19,815

Re: Flagging Duplicates across multiple Rows

[ Edited ]

Shouldn't records 4/5 be flagged as well? 

 

Edit: How big is the data you're working with, number of rows? 

 

Second, have you fully explored the variables and data available. Typically when a record os reversed or a return is processed there some ID that's unique to those records to allow for matching, besides the amount. Sometimes it's a transaction ID or just an indicator on the field status. 

 

You could consider reversing the sign on the returns but adding a field that identifies returns, a return flag. Then you could leverage BY group logic to identify you returned purchases since the amounts are the same, but they'll have different return flags. 

Occasional Contributor
Posts: 6

Re: Flagging Duplicates across multiple Rows

Hi Reeza

 

Thanks for the feedback. Highly appreciate it.

 

You are correct;  records 4/5 should be flagged as well; indeed an error on my end.

 

The data is large ~ 500 million records spanning over two years.

 

I have also looked at the possibility of using other variables however  I could not identify any. I also chatted to our IT teams around how the data is being generated in the core systems and I have not been able to identify any possible work around using the existing variables.

 

Thanks

Super User
Posts: 10,028

Re: Flagging Duplicates across multiple Rows

you need to watch out the decimal part, maybe consider to use function ROUND().

 

 

data have;
infile cards expandtabs truncover;
input CustomerNumber $	AccountNumber $ Date : yymmdd10. Amount	k;
format date yymmdd10.;
cards; 
C001	Acc001	2015/06/03	550	1	0	 
C001	Acc001	2015/06/08	-199	7	1	Clustering
C001	Acc002	2015/06/08	199	7	1
C001	Acc001	2015/06/08	3850	7	0
C001	Acc003	2015/06/08	-3850	7	0
C001	Acc001	2015/06/08	-15776.75	7	1
C001	Acc001	2015/06/08	999.23	7	0
C001	Acc003	2015/06/08	15776.75	7	1
C001	Acc001	2015/06/26	-10000	1	0	 
C001	Acc001	2015/06/29	4400	2	0	Clustering
C001	Acc001	2015/06/29	-8500	2	0
C001	Acc001	2015/06/30	-1500	2	1
C001	Acc002	2015/06/30	1500	2
;
run;
data temp;
 set have;
 by k notsorted;
 group+first.k;
 abs=abs(amount);
run;
proc sort data=temp;by group abs;run;
data want;
 set temp;
 by group abs notsorted;
 if first.abs and last.abs then flag=0;
  else flag=1;
drop group abs;
run;
Occasional Contributor
Posts: 6

Re: Flagging Duplicates across multiple Rows

Hi KSharp

 

Thanks for the nifty code. 

Works really great on the sample I provided Cat Very Happy

 

I have applied it against the full data set and noticed the following:

 

  1. On transcations that are occuring but not "transfers" they are being flagged as well

I have provided an updated record set to illustrate this specific issue. 

Do you have any suggestions?

 

Customer NumberAccount NumberDateAmountkFlag_I_WantNarrativeCurrent_Flag
C001Acc0012015/06/0355010 0
C001Acc0012015/06/08-19971Payment to acc0021
C001Acc0022015/06/0819970McDonalds1
C001Acc0022015/06/0819971Receipt acc0021
C001Acc0022015/06/085070Toll Charge1
C001Acc0022015/06/085070Toll Charge1
C001Acc0012015/06/08385071 1
C001Acc0032015/06/08-385071 1
C001Acc0012015/06/08-15776.7571 1
C001Acc0012015/06/08999.2370 0
C001Acc0032015/06/0815776.7571 1
C001Acc0012015/06/26-1000010 0
C001Acc0012015/06/29440020 0
C001Acc0012015/06/29-850020 0
C001Acc0012015/06/30-150021 1
C001Acc0022015/06/30150021 1
C002Various Account NumberVarious Datesxxxxxx  
C003Various Account NumberVarious Datesxxxxxx  

 

Thanks

Occasional Contributor
Posts: 6

Re: Flagging Duplicates across multiple Rows

[ Edited ]

Just to add another point the payment and receipt narratives are not generic; I have simplified it for illustration purposes.

Occasional Contributor
Posts: 6

Re: Flagging Duplicates across multiple Rows

I figured that if I just join the table to itself I can get the desired results as well.

Solution
‎12-13-2016 07:02 AM
Super User
Posts: 10,028

Re: Flagging Duplicates across multiple Rows

That would be very complicated.

You also need to consider whether the positive number is greater than negative number .

 

data have;
infile cards expandtabs truncover;
input CustomerNumber $	AccountNumber $ Date : yymmdd10. Amount	k;
format date yymmdd10.;
cards;
C001	Acc001	2015/06/03	550	1	0	 	0
C001	Acc001	2015/06/08	-199	7	1	Payment to acc002	1
C001	Acc002	2015/06/08	199	7	0	McDonalds	1
C001	Acc002	2015/06/08	199	7	1	Receipt acc002	1
C001	Acc002	2015/06/08	50	7	0	Toll Charge	1
C001	Acc002	2015/06/08	50	7	0	Toll Charge	1
C001	Acc001	2015/06/08	3850	7	1	 	1
C001	Acc003	2015/06/08	-3850	7	1	 	1
C001	Acc001	2015/06/08	-15776.75	7	1	 	1
C001	Acc001	2015/06/08	999.23	7	0	 	0
C001	Acc003	2015/06/08	15776.75	7	1	 	1
C001	Acc001	2015/06/26	-10000	1	0	 	0
C001	Acc001	2015/06/29	4400	2	0	 	0
C001	Acc001	2015/06/29	-8500	2	0	 	0
C001	Acc001	2015/06/30	-1500	2	1	 	1
C001	Acc002	2015/06/30	1500	2	1
;
run;
data temp;
 set have;
 by k notsorted;
 group+first.k;
 abs=abs(amount);
run;
proc sort data=temp;by group abs;run;
data positive negative(rename=(CustomerNumber=_c
AccountNumber=_ac Date=_d Amount=_am k=_k));
 set temp;
 if amount gt 0 then output positive;
  else output negative;
run;
data x;
 ina=0;inb=0;
 merge positive(in=ina) negative(in=inb);
 by group abs;
 one=ina;two=inb;
 output;
 call missing(of _all_);
run;

data want;
 set x;
 if one and not two then do;flag=0;output;end;
  else if not one and two then do;
   CustomerNumber=_c;
   AccountNumber=_ac;
   Date=_d;
   Amount=_am; 
   k=_k;
   flag=0;output;
  end;
  else do;
   flag=1;output;
   CustomerNumber=_c;
   AccountNumber=_ac;
   Date=_d;
   Amount=_am; 
   k=_k;
   output;
  end;
drop  group abs one two _:;
run;
Occasional Contributor
Posts: 6

Re: Flagging Duplicates across multiple Rows

Thank again

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 266 views
  • 2 likes
  • 3 in conversation