topic Re: Flagging Duplicates within Grouped Records Based off Near Matches in SAS Programming

Flagging Duplicates within Grouped Records Based off Near Matches

publicSynechism — Mon, 04 Mar 2019 22:40:52 GMT

I'm trying to group by CID QTR OPENDATE, and flag records as duplicates if the balances are near the same (say within 15% of each other). So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low.

data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;

Re: Flagging Duplicates within Grouped Records Based off Near Matches

novinosrin — Mon, 04 Mar 2019 23:09:45 GMT

Hi @publicSynechism

With this

"So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low."

I am wondering why have you left C? I.e

1100-1000/1000= 10%

Not sure what i am missing

Re: Flagging Duplicates within Grouped Records Based off Near Matches

novinosrin — Mon, 04 Mar 2019 23:16:36 GMT

So basically, the preliminary test here as follows

data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;

data test;
set data;
by cid;
k=lag(balance);
if not first.cid then diff_pct=round(abs((balance-k)/balance)*100);
run;

The dif_pct column has the values 9,10,150 considering 400 is too low as you mentioned

cid loanid servicer qtr balance opendate k diff_pct 
a A s1 14610 1000 14245 . . 
a B s2 14610 1100 14245 1000 9 
a C s3 14610 1000 14610 1100 10 
a D s4 14610 400 14245 1000 150

However, wouldn't A,B and C qualify?

Re: Flagging Duplicates within Grouped Records Based off Near Matches

publicSynechism — Mon, 04 Mar 2019 23:17:11 GMT

Hi @novinosrin ,

C is the only loan with a different OPENDATE, so although the balance is near same, I don't want to consider it as a possible duplicate within the QTR.

Re: Flagging Duplicates within Grouped Records Based off Near Matches

publicSynechism — Mon, 04 Mar 2019 23:21:39 GMT

I think I understand, but how is the lag function working in this case. Does it take only the first observation BALANCE, and then all the other balances are checked against that?

Would there be a way to check every balance against every other? Such that we would end up with diff_pct1 to n, where n is the number of loans being checked? And then ultimately come up with a set of LOANIDs that all had balance within set % of eachother?

Re: Flagging Duplicates within Grouped Records Based off Near Matches

novinosrin — Mon, 04 Mar 2019 23:29:40 GMT

Could you post a sample of your expected output to the input sample and leave some comments to why(as in the logic)? plz

@publicSynechism Lag basically checks for the value in the previous obs in a queue. For example lag1 is previous, lag2 is current looking above 2 and so on upto lagN

Re: Flagging Duplicates within Grouped Records Based off Near Matches

publicSynechism — Mon, 04 Mar 2019 23:33:00 GMT

INPUT

cid loanid servicer qtr balance opendate
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999

OUTPUT

cid loanid servicer qtr balance opendate dup_flag
a A s1 01/01/2000 1000 01/01/1999 1
a B s2 01/01/2000 1100 01/01/1999 1
a C s3 01/01/2000 1000 01/01/2000 .
a D s4 01/01/2000 400 01/01/1999 0

I'm looking only at same CID QTR and OPENDATE, so C got excluded from the check, but A,B had close enough balance to be considered duplicates.

I understand what you coded now, and it works well in this case, but I'm wondering in a more complex case with many LOANIDs in a single quarter if it could be done. Correct me if I'm wrong, but I'm thinking some sort of k-nearest neighbors grouping on Balance and then flagging within the group as duplicates. (Sorry, I'm still new to SAS and programming in general)

Re: Flagging Duplicates within Grouped Records Based off Near Matches

novinosrin — Mon, 04 Mar 2019 23:49:45 GMT

I'm afraid KNN non parametric method is not my area of expertise but if you want some programming help, I may be able to.

So if you play with lag using

by CID QTR OPENDATE;

with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it

data test1;
set data;
by CID QTR OPENDATE;
k=lag(balance);
if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);
run;