Solved: Flagging Duplicates within Grouped Records Based off Near Matches

publicSynechism · Posted 03-04-2019 05:40 PM

I'm trying to group by CID QTR OPENDATE, and flag records as duplicates if the balances are near the same (say within 15% of each other). So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low.

data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;

novinosrin · Posted 03-04-2019 06:49 PM

I'm afraid KNN non parametric method is not my area of expertise but if you want some programming help, I may be able to.

So if you play with lag using

by CID QTR OPENDATE;

with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it

data test1;
set data;
by CID QTR OPENDATE;
k=lag(balance);
if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);
run;

View solution in original post

novinosrin · Posted 03-04-2019 06:09 PM

Hi @publicSynechism

With this

"So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low."

I am wondering why have you left C? I.e

1100-1000/1000= 10%

?

Not sure what i am missing

publicSynechism · Posted 03-04-2019 06:17 PM

Hi @novinosrin ,

C is the only loan with a different OPENDATE, so although the balance is near same, I don't want to consider it as a possible duplicate within the QTR.

novinosrin · Posted 03-04-2019 06:16 PM

So basically, the preliminary test here as follows

data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;

data test;
set data;
by cid;
k=lag(balance);
if not first.cid then diff_pct=round(abs((balance-k)/balance)*100);
run;

The dif_pct column has the values 9,10,150 considering 400 is too low as you mentioned

cid loanid servicer qtr balance opendate k diff_pct 
a A s1 14610 1000 14245 . . 
a B s2 14610 1100 14245 1000 9 
a C s3 14610 1000 14610 1100 10 
a D s4 14610 400 14245 1000 150

However, wouldn't A,B and C qualify?

publicSynechism · Posted 03-04-2019 06:21 PM

I think I understand, but how is the lag function working in this case. Does it take only the first observation BALANCE, and then all the other balances are checked against that?

Would there be a way to check every balance against every other? Such that we would end up with diff_pct1 to n, where n is the number of loans being checked? And then ultimately come up with a set of LOANIDs that all had balance within set % of eachother?

novinosrin · Posted 03-04-2019 06:23 PM

Could you post a sample of your expected output to the input sample and leave some comments to why(as in the logic)? plz

@publicSynechism Lag basically checks for the value in the previous obs in a queue. For example lag1 is previous, lag2 is current looking above 2 and so on upto lagN

publicSynechism · Posted 03-04-2019 06:33 PM

INPUT

cid loanid servicer qtr balance opendate
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999

OUTPUT

cid loanid servicer qtr balance opendate dup_flag
a A s1 01/01/2000 1000 01/01/1999 1
a B s2 01/01/2000 1100 01/01/1999 1
a C s3 01/01/2000 1000 01/01/2000 .
a D s4 01/01/2000 400 01/01/1999 0

I'm looking only at same CID QTR and OPENDATE, so C got excluded from the check, but A,B had close enough balance to be considered duplicates.

I understand what you coded now, and it works well in this case, but I'm wondering in a more complex case with many LOANIDs in a single quarter if it could be done. Correct me if I'm wrong, but I'm thinking some sort of k-nearest neighbors grouping on Balance and then flagging within the group as duplicates. (Sorry, I'm still new to SAS and programming in general)

novinosrin · Posted 03-04-2019 06:49 PM

I'm afraid KNN non parametric method is not my area of expertise but if you want some programming help, I may be able to.

So if you play with lag using

by CID QTR OPENDATE;

with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it

data test1;
set data;
by CID QTR OPENDATE;
k=lag(balance);
if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);
run;

Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

Re: Flagging Duplicates within Grouped Records Based off Near Matches

SAS Innovate 2025: Call for Content

Classroom Training Available!