Hi
I am trying to loop through a file and flag all accounts that meet a specific criteria.
I want to to set the flag value to 1 on records that have same customer numbers, account numbers, date and opposite amounts.
What I want to do is:
Data and result I am aiming for: (create new variable flag)
Customer Number | Account Number | Date | Amount | k | Flag | |
C001 | Acc001 | 2015/06/03 | 550 | 1 | 0 | |
C001 | Acc001 | 2015/06/08 | -199 | 7 | 1 | Clustering |
C001 | Acc002 | 2015/06/08 | 199 | 7 | 1 | |
C001 | Acc001 | 2015/06/08 | 3850 | 7 | 0 | |
C001 | Acc003 | 2015/06/08 | -3850 | 7 | 0 | |
C001 | Acc001 | 2015/06/08 | -15776.75 | 7 | 1 | |
C001 | Acc001 | 2015/06/08 | 999.23 | 7 | 0 | |
C001 | Acc003 | 2015/06/08 | 15776.75 | 7 | 1 | |
C001 | Acc001 | 2015/06/26 | -10000 | 1 | 0 | |
C001 | Acc001 | 2015/06/29 | 4400 | 2 | 0 | Clustering |
C001 | Acc001 | 2015/06/29 | -8500 | 2 | 0 | |
C001 | Acc001 | 2015/06/30 | -1500 | 2 | 1 | |
C001 | Acc002 | 2015/06/30 | 1500 | 2 | 1 | |
C002 | Various Account Number | Various Dates | xxx | x | xx | |
C003 | Various Account Number | Various Dates | xxx | x | xx |
I have built the following code logic:
/* Macro to SCAN through DATALOG */ %MACRO SCANLOOP(SCANFILE,FIELD1,FIELD2, FIELD3, FIELD4, FIELD5); /* First obtain the number of records in DATALOG */ DATA _NULL_; IF 0 THEN SET &SCANFILE NOBS=X; CALL SYMPUTX('RECCOUNT',X); %put &RECCOUNT.; STOP; RUN; /* loop from one to number of records */ /*%DO I=1 %TO &RECCOUNT;*/ %DO I=1 %TO 100; DATA _NULL_; /* Advance to the Ith record */ SET &SCANFILE (FIRSTOBS=&I); /* store the variables of interest in */ /* macro variables */ /* Symput is for characters*/ /* symputx is for numeric to character conversion with removal*/ /* of leading and trailing spaces*/ CALL SYMPUT('VAR1',&FIELD1); CALL SYMPUTX('VAR2',&FIELD2); CALL SYMPUTX('VAR3',&FIELD3); CALL SYMPUTX('VAR4',&FIELD4); CALL SYMPUTX('VAR5',&FIELD5); CALL SYMPUTX('VAR6',k); %put &VAR1. &VAR2. &VAR3. &VAR4. &VAR5. &VAR6.; STOP; RUN; /* now perform the tasks that */ /* wish repeated for each */ /* observation */ /*SETUP a blank file */ /*DATA RECON5;*/ /* if &i=0 then SET recon4 (OBS=0);*/ /*run;*/ DATA recon5; SET recon5 recon4; IF Narrative ^= 'DAILY BALANCE' AND customer_number = symget('VAR2') AND date = symget('VAR3') AND account_number NOT = symget('VAR4') AND amount = -(symget('VAR5')) THEN DO; flag=1; output recon5; END; ELSE flag=0; /*input(original_variable, informat.)*/ %END; %MEND SCANLOOP; /* Call SCANLOOP */ /*%SCANLOOP(DATALOG,FILENM,DESC);*/ %SCANLOOP(recon4,narrative,customer_number,date,account_number,amount); RUN;
That would be very complicated.
You also need to consider whether the positive number is greater than negative number .
data have;
infile cards expandtabs truncover;
input CustomerNumber $ AccountNumber $ Date : yymmdd10. Amount k;
format date yymmdd10.;
cards;
C001 Acc001 2015/06/03 550 1 0 0
C001 Acc001 2015/06/08 -199 7 1 Payment to acc002 1
C001 Acc002 2015/06/08 199 7 0 McDonalds 1
C001 Acc002 2015/06/08 199 7 1 Receipt acc002 1
C001 Acc002 2015/06/08 50 7 0 Toll Charge 1
C001 Acc002 2015/06/08 50 7 0 Toll Charge 1
C001 Acc001 2015/06/08 3850 7 1 1
C001 Acc003 2015/06/08 -3850 7 1 1
C001 Acc001 2015/06/08 -15776.75 7 1 1
C001 Acc001 2015/06/08 999.23 7 0 0
C001 Acc003 2015/06/08 15776.75 7 1 1
C001 Acc001 2015/06/26 -10000 1 0 0
C001 Acc001 2015/06/29 4400 2 0 0
C001 Acc001 2015/06/29 -8500 2 0 0
C001 Acc001 2015/06/30 -1500 2 1 1
C001 Acc002 2015/06/30 1500 2 1
;
run;
data temp;
set have;
by k notsorted;
group+first.k;
abs=abs(amount);
run;
proc sort data=temp;by group abs;run;
data positive negative(rename=(CustomerNumber=_c
AccountNumber=_ac Date=_d Amount=_am k=_k));
set temp;
if amount gt 0 then output positive;
else output negative;
run;
data x;
ina=0;inb=0;
merge positive(in=ina) negative(in=inb);
by group abs;
one=ina;two=inb;
output;
call missing(of _all_);
run;
data want;
set x;
if one and not two then do;flag=0;output;end;
else if not one and two then do;
CustomerNumber=_c;
AccountNumber=_ac;
Date=_d;
Amount=_am;
k=_k;
flag=0;output;
end;
else do;
flag=1;output;
CustomerNumber=_c;
AccountNumber=_ac;
Date=_d;
Amount=_am;
k=_k;
output;
end;
drop group abs one two _:;
run;
Shouldn't records 4/5 be flagged as well?
Edit: How big is the data you're working with, number of rows?
Second, have you fully explored the variables and data available. Typically when a record os reversed or a return is processed there some ID that's unique to those records to allow for matching, besides the amount. Sometimes it's a transaction ID or just an indicator on the field status.
You could consider reversing the sign on the returns but adding a field that identifies returns, a return flag. Then you could leverage BY group logic to identify you returned purchases since the amounts are the same, but they'll have different return flags.
Hi Reeza
Thanks for the feedback. Highly appreciate it.
You are correct; records 4/5 should be flagged as well; indeed an error on my end.
The data is large ~ 500 million records spanning over two years.
I have also looked at the possibility of using other variables however I could not identify any. I also chatted to our IT teams around how the data is being generated in the core systems and I have not been able to identify any possible work around using the existing variables.
Thanks
you need to watch out the decimal part, maybe consider to use function ROUND().
data have;
infile cards expandtabs truncover;
input CustomerNumber $ AccountNumber $ Date : yymmdd10. Amount k;
format date yymmdd10.;
cards;
C001 Acc001 2015/06/03 550 1 0
C001 Acc001 2015/06/08 -199 7 1 Clustering
C001 Acc002 2015/06/08 199 7 1
C001 Acc001 2015/06/08 3850 7 0
C001 Acc003 2015/06/08 -3850 7 0
C001 Acc001 2015/06/08 -15776.75 7 1
C001 Acc001 2015/06/08 999.23 7 0
C001 Acc003 2015/06/08 15776.75 7 1
C001 Acc001 2015/06/26 -10000 1 0
C001 Acc001 2015/06/29 4400 2 0 Clustering
C001 Acc001 2015/06/29 -8500 2 0
C001 Acc001 2015/06/30 -1500 2 1
C001 Acc002 2015/06/30 1500 2
;
run;
data temp;
set have;
by k notsorted;
group+first.k;
abs=abs(amount);
run;
proc sort data=temp;by group abs;run;
data want;
set temp;
by group abs notsorted;
if first.abs and last.abs then flag=0;
else flag=1;
drop group abs;
run;
Hi KSharp
Thanks for the nifty code.
Works really great on the sample I provided
I have applied it against the full data set and noticed the following:
I have provided an updated record set to illustrate this specific issue.
Do you have any suggestions?
Customer Number | Account Number | Date | Amount | k | Flag_I_Want | Narrative | Current_Flag |
C001 | Acc001 | 2015/06/03 | 550 | 1 | 0 | 0 | |
C001 | Acc001 | 2015/06/08 | -199 | 7 | 1 | Payment to acc002 | 1 |
C001 | Acc002 | 2015/06/08 | 199 | 7 | 0 | McDonalds | 1 |
C001 | Acc002 | 2015/06/08 | 199 | 7 | 1 | Receipt acc002 | 1 |
C001 | Acc002 | 2015/06/08 | 50 | 7 | 0 | Toll Charge | 1 |
C001 | Acc002 | 2015/06/08 | 50 | 7 | 0 | Toll Charge | 1 |
C001 | Acc001 | 2015/06/08 | 3850 | 7 | 1 | 1 | |
C001 | Acc003 | 2015/06/08 | -3850 | 7 | 1 | 1 | |
C001 | Acc001 | 2015/06/08 | -15776.75 | 7 | 1 | 1 | |
C001 | Acc001 | 2015/06/08 | 999.23 | 7 | 0 | 0 | |
C001 | Acc003 | 2015/06/08 | 15776.75 | 7 | 1 | 1 | |
C001 | Acc001 | 2015/06/26 | -10000 | 1 | 0 | 0 | |
C001 | Acc001 | 2015/06/29 | 4400 | 2 | 0 | 0 | |
C001 | Acc001 | 2015/06/29 | -8500 | 2 | 0 | 0 | |
C001 | Acc001 | 2015/06/30 | -1500 | 2 | 1 | 1 | |
C001 | Acc002 | 2015/06/30 | 1500 | 2 | 1 | 1 | |
C002 | Various Account Number | Various Dates | xxx | x | xx | ||
C003 | Various Account Number | Various Dates | xxx | x | xx |
Thanks
Just to add another point the payment and receipt narratives are not generic; I have simplified it for illustration purposes.
I figured that if I just join the table to itself I can get the desired results as well.
That would be very complicated.
You also need to consider whether the positive number is greater than negative number .
data have;
infile cards expandtabs truncover;
input CustomerNumber $ AccountNumber $ Date : yymmdd10. Amount k;
format date yymmdd10.;
cards;
C001 Acc001 2015/06/03 550 1 0 0
C001 Acc001 2015/06/08 -199 7 1 Payment to acc002 1
C001 Acc002 2015/06/08 199 7 0 McDonalds 1
C001 Acc002 2015/06/08 199 7 1 Receipt acc002 1
C001 Acc002 2015/06/08 50 7 0 Toll Charge 1
C001 Acc002 2015/06/08 50 7 0 Toll Charge 1
C001 Acc001 2015/06/08 3850 7 1 1
C001 Acc003 2015/06/08 -3850 7 1 1
C001 Acc001 2015/06/08 -15776.75 7 1 1
C001 Acc001 2015/06/08 999.23 7 0 0
C001 Acc003 2015/06/08 15776.75 7 1 1
C001 Acc001 2015/06/26 -10000 1 0 0
C001 Acc001 2015/06/29 4400 2 0 0
C001 Acc001 2015/06/29 -8500 2 0 0
C001 Acc001 2015/06/30 -1500 2 1 1
C001 Acc002 2015/06/30 1500 2 1
;
run;
data temp;
set have;
by k notsorted;
group+first.k;
abs=abs(amount);
run;
proc sort data=temp;by group abs;run;
data positive negative(rename=(CustomerNumber=_c
AccountNumber=_ac Date=_d Amount=_am k=_k));
set temp;
if amount gt 0 then output positive;
else output negative;
run;
data x;
ina=0;inb=0;
merge positive(in=ina) negative(in=inb);
by group abs;
one=ina;two=inb;
output;
call missing(of _all_);
run;
data want;
set x;
if one and not two then do;flag=0;output;end;
else if not one and two then do;
CustomerNumber=_c;
AccountNumber=_ac;
Date=_d;
Amount=_am;
k=_k;
flag=0;output;
end;
else do;
flag=1;output;
CustomerNumber=_c;
AccountNumber=_ac;
Date=_d;
Amount=_am;
k=_k;
output;
end;
drop group abs one two _:;
run;
Thank again
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.