<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Flagging Duplicates within Grouped Records Based off Near Matches in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540284#M149014</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/216525"&gt;@publicSynechism&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With this&lt;/P&gt;
&lt;P&gt;"&lt;SPAN&gt;So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low."&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I am wondering why have you left C? I.e&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;1100-1000/1000=&amp;nbsp; 10%&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Not sure what i am missing&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 04 Mar 2019 23:09:45 GMT</pubDate>
    <dc:creator>novinosrin</dc:creator>
    <dc:date>2019-03-04T23:09:45Z</dc:date>
    <item>
      <title>Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540274#M149010</link>
      <description>&lt;P&gt;I'm trying to group by CID QTR OPENDATE, and flag records as duplicates if the balances are near the same (say within 15% of each other). So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Mon, 04 Mar 2019 22:40:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540274#M149010</guid>
      <dc:creator>publicSynechism</dc:creator>
      <dc:date>2019-03-04T22:40:52Z</dc:date>
    </item>
    <item>
      <title>Re: Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540284#M149014</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/216525"&gt;@publicSynechism&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With this&lt;/P&gt;
&lt;P&gt;"&lt;SPAN&gt;So in this case, I would be looking at LOANIDs {A,B,D} and A,B would be flagged as duplicates, but not D since the balance was too low."&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I am wondering why have you left C? I.e&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;1100-1000/1000=&amp;nbsp; 10%&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Not sure what i am missing&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2019 23:09:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540284#M149014</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2019-03-04T23:09:45Z</dc:date>
    </item>
    <item>
      <title>Re: Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540285#M149015</link>
      <description>&lt;P&gt;So basically, the preliminary test here as follows&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data data;
input cid $ loanid $ servicer $ qtr mmddyy10. balance opendate mmddyy10.;
datalines;
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999
;
run;

data test;
set data;
by cid;
k=lag(balance);
if not first.cid then diff_pct=round(abs((balance-k)/balance)*100);
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The dif_pct column has the values 9,10,150 considering 400 is too low as you mentioned&lt;/P&gt;
&lt;PRE&gt;cid loanid servicer qtr balance opendate k diff_pct 
a A s1 14610 1000 14245 . . 
a B s2 14610 1100 14245 1000 9 
a C s3 14610 1000 14610 1100 10 
a D s4 14610 400 14245 1000 150 

&lt;/PRE&gt;
&lt;P&gt;However, wouldn't A,B and C qualify?&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2019 23:16:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540285#M149015</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2019-03-04T23:16:36Z</dc:date>
    </item>
    <item>
      <title>Re: Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540286#M149016</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/138205"&gt;@novinosrin&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;C is the only loan with a different OPENDATE, so although the balance is near same, I don't want to consider it as a possible duplicate within the QTR.&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2019 23:17:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540286#M149016</guid>
      <dc:creator>publicSynechism</dc:creator>
      <dc:date>2019-03-04T23:17:11Z</dc:date>
    </item>
    <item>
      <title>Re: Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540288#M149017</link>
      <description>&lt;P&gt;I think I understand, but how is the lag function working in this case. Does it take only the first observation BALANCE, and then all the other balances are checked against that?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Would there be a way to check every balance against every other? Such that we would end up with diff_pct1 to n, where n is the number of loans being checked? And then ultimately come up with a set of LOANIDs that all had balance within set % of eachother?&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2019 23:21:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540288#M149017</guid>
      <dc:creator>publicSynechism</dc:creator>
      <dc:date>2019-03-04T23:21:39Z</dc:date>
    </item>
    <item>
      <title>Re: Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540289#M149018</link>
      <description>&lt;P&gt;Could you post a sample of your expected&amp;nbsp; output to the input sample and leave some comments to why(as in the logic)? plz&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/216525"&gt;@publicSynechism&lt;/a&gt;&amp;nbsp; &amp;nbsp;Lag basically checks for the value in the previous obs in a queue. For example lag1 is previous, lag2 is current looking above 2 and so on upto lag&lt;STRONG&gt;N&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2019 23:29:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540289#M149018</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2019-03-04T23:29:40Z</dc:date>
    </item>
    <item>
      <title>Re: Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540291#M149020</link>
      <description>&lt;PRE&gt;INPUT

cid loanid servicer qtr balance opendate
a A s1 01/01/2000 1000 01/01/1999
a B s2 01/01/2000 1100 01/01/1999
a C s3 01/01/2000 1000 01/01/2000
a D s4 01/01/2000 400 01/01/1999&lt;/PRE&gt;&lt;PRE&gt;OUTPUT

cid loanid servicer qtr balance opendate dup_flag
a A s1 01/01/2000 1000 01/01/1999 1
a B s2 01/01/2000 1100 01/01/1999 1
a C s3 01/01/2000 1000 01/01/2000 .
a D s4 01/01/2000 400 01/01/1999 0&lt;/PRE&gt;&lt;P&gt;I'm looking only at same CID QTR and OPENDATE, so C got excluded from the check, but A,B had close enough balance to be considered duplicates.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I understand what you coded now, and it works well in this case, but I'm wondering in a more complex case with many LOANIDs in a single quarter if it could be done. Correct me if I'm wrong, but I'm thinking some sort of k-nearest neighbors grouping on Balance and then flagging within the group as duplicates. (Sorry, I'm still new to SAS and programming in general)&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2019 23:33:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540291#M149020</guid>
      <dc:creator>publicSynechism</dc:creator>
      <dc:date>2019-03-04T23:33:00Z</dc:date>
    </item>
    <item>
      <title>Re: Flagging Duplicates within Grouped Records Based off Near Matches</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540294#M149022</link>
      <description>&lt;P&gt;I'm afraid &lt;STRONG&gt;KNN&lt;/STRONG&gt; non parametric method is not my area of expertise but if you want some programming help, I may be able to.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;So if you play with lag using&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;by CID QTR OPENDATE;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;with OPENDATE being the child group/inner most sub group, with first and last, you will get the hang of it&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;data test1;&lt;BR /&gt;set data;&lt;BR /&gt;&lt;EM&gt;&lt;STRONG&gt;by CID QTR OPENDATE;&lt;/STRONG&gt;&lt;/EM&gt;&lt;BR /&gt;k=lag(balance);&lt;BR /&gt;if not first.OPENDATE then diff_pct=round(abs((balance-k)/balance)*100);&lt;BR /&gt;run;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Mar 2019 23:49:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Flagging-Duplicates-within-Grouped-Records-Based-off-Near/m-p/540294#M149022</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2019-03-04T23:49:45Z</dc:date>
    </item>
  </channel>
</rss>

