I'm trying to group by CID QTR OPENDATE, and flag records as duplicates if the balances are near the same (say within 15% of each other). So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low.
data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;
I'm afraid KNN non parametric method is not my area of expertise but if you want some programming help, I may be able to.
So if you play with lag using
by CID QTR OPENDATE;
with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it
data test1;
set data;
by CID QTR OPENDATE;
k=lag(balance);
if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);
run;
With this
"So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low."
I am wondering why have you left C? I.e
1100-1000/1000= 10%
?
Not sure what i am missing
Hi @novinosrin ,
C is the only loan with a different OPENDATE, so although the balance is near same, I don't want to consider it as a possible duplicate within the QTR.
So basically, the preliminary test here as follows
data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;
data test;
set data;
by cid;
k=lag(balance);
if not first.cid then diff_pct=round(abs((balance-k)/balance)*100);
run;
The dif_pct column has the values 9,10,150 considering 400 is too low as you mentioned
cid loanid servicer qtr balance opendate k diff_pct a A s1 14610 1000 14245 . . a B s2 14610 1100 14245 1000 9 a C s3 14610 1000 14610 1100 10 a D s4 14610 400 14245 1000 150
However, wouldn't A,B and C qualify?
I think I understand, but how is the lag function working in this case. Does it take only the first observation BALANCE, and then all the other balances are checked against that?
Would there be a way to check every balance against every other? Such that we would end up with diff_pct1 to n, where n is the number of loans being checked? And then ultimately come up with a set of LOANIDs that all had balance within set % of eachother?
Could you post a sample of your expected output to the input sample and leave some comments to why(as in the logic)? plz
@publicSynechism Lag basically checks for the value in the previous obs in a queue. For example lag1 is previous, lag2 is current looking above 2 and so on upto lagN
INPUT cid loanid servicer qtr balance opendate a A s1 01/01/2000 1000 01/01/1999 a B s2 01/01/2000 1100 01/01/1999 a C s3 01/01/2000 1000 01/01/2000 a D s4 01/01/2000 400 01/01/1999
OUTPUT cid loanid servicer qtr balance opendate dup_flag a A s1 01/01/2000 1000 01/01/1999 1 a B s2 01/01/2000 1100 01/01/1999 1 a C s3 01/01/2000 1000 01/01/2000 . a D s4 01/01/2000 400 01/01/1999 0
I'm looking only at same CID QTR and OPENDATE, so C got excluded from the check, but A,B had close enough balance to be considered duplicates.
I understand what you coded now, and it works well in this case, but I'm wondering in a more complex case with many LOANIDs in a single quarter if it could be done. Correct me if I'm wrong, but I'm thinking some sort of k-nearest neighbors grouping on Balance and then flagging within the group as duplicates. (Sorry, I'm still new to SAS and programming in general)
I'm afraid KNN non parametric method is not my area of expertise but if you want some programming help, I may be able to.
So if you play with lag using
by CID QTR OPENDATE;
with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it
data test1;
set data;
by CID QTR OPENDATE;
k=lag(balance);
if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.