BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
publicSynechism
Fluorite | Level 6

I'm trying to group by CID QTR OPENDATE, and flag records as duplicates if the balances are near the same (say within 15% of each other). So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low.

data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

I'm afraid KNN non parametric method is not my area of expertise but if you want some programming help, I may be able to.

 

So if you play with lag using 

 

by CID QTR OPENDATE;

 

with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it

 


data test1;
set data;
by CID QTR OPENDATE;
k=lag(balance);
if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);
run;

 

 

 

 

View solution in original post

7 REPLIES 7
novinosrin
Tourmaline | Level 20

Hi @publicSynechism 

 

With this

"So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low."

 

I am wondering why have you left C? I.e

1100-1000/1000=  10%

 

?

Not sure what i am missing

publicSynechism
Fluorite | Level 6

Hi @novinosrin ,

 

C is the only loan with a different OPENDATE, so although the balance is near same, I don't want to consider it as a possible duplicate within the QTR.

novinosrin
Tourmaline | Level 20

So basically, the preliminary test here as follows

 

data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;

data test;
set data;
by cid;
k=lag(balance);
if not first.cid then diff_pct=round(abs((balance-k)/balance)*100);
run;

The dif_pct column has the values 9,10,150 considering 400 is too low as you mentioned

cid loanid servicer qtr balance opendate k diff_pct 
a A s1 14610 1000 14245 . . 
a B s2 14610 1100 14245 1000 9 
a C s3 14610 1000 14610 1100 10 
a D s4 14610 400 14245 1000 150 

However, wouldn't A,B and C qualify?

publicSynechism
Fluorite | Level 6

I think I understand, but how is the lag function working in this case. Does it take only the first observation BALANCE, and then all the other balances are checked against that?

 

Would there be a way to check every balance against every other? Such that we would end up with diff_pct1 to n, where n is the number of loans being checked? And then ultimately come up with a set of LOANIDs that all had balance within set % of eachother?

novinosrin
Tourmaline | Level 20

Could you post a sample of your expected  output to the input sample and leave some comments to why(as in the logic)? plz

 

@publicSynechism   Lag basically checks for the value in the previous obs in a queue. For example lag1 is previous, lag2 is current looking above 2 and so on upto lagN

publicSynechism
Fluorite | Level 6
INPUT

cid loanid servicer qtr balance opendate
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
OUTPUT

cid loanid servicer qtr balance opendate dup_flag
a A s1 01/01/2000 1000 01/01/1999 1
a B s2 01/01/2000 1100 01/01/1999 1
a C s3 01/01/2000 1000 01/01/2000 .
a D s4 01/01/2000 400 01/01/1999 0

I'm looking only at same CID QTR and OPENDATE, so C got excluded from the check, but A,B had close enough balance to be considered duplicates. 

 

I understand what you coded now, and it works well in this case, but I'm wondering in a more complex case with many LOANIDs in a single quarter if it could be done. Correct me if I'm wrong, but I'm thinking some sort of k-nearest neighbors grouping on Balance and then flagging within the group as duplicates. (Sorry, I'm still new to SAS and programming in general) 

novinosrin
Tourmaline | Level 20

I'm afraid KNN non parametric method is not my area of expertise but if you want some programming help, I may be able to.

 

So if you play with lag using 

 

by CID QTR OPENDATE;

 

with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it

 


data test1;
set data;
by CID QTR OPENDATE;
k=lag(balance);
if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);
run;

 

 

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1074 views
  • 1 like
  • 2 in conversation