<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Distance / Similarity with Events in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/Distance-Similarity-with-Events/m-p/415710#M21812</link>
    <description>&lt;P&gt;Look at &lt;STRONG&gt;HPSPLIT&lt;/STRONG&gt; to build a simple model for fraud classification. Then look at the non-fraud cases missclassified as fraud. If you are right (and somewhat Lucky)&amp;nbsp;some of those&amp;nbsp;should be overlooked fraud cases.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 22 Nov 2017 22:54:51 GMT</pubDate>
    <dc:creator>PGStats</dc:creator>
    <dc:date>2017-11-22T22:54:51Z</dc:date>
    <item>
      <title>Distance / Similarity with Events</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Distance-Similarity-with-Events/m-p/415681#M21810</link>
      <description>&lt;DIV class="post-text"&gt;
&lt;P&gt;I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?&lt;/P&gt;
&lt;/DIV&gt;</description>
      <pubDate>Wed, 22 Nov 2017 21:50:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Distance-Similarity-with-Events/m-p/415681#M21810</guid>
      <dc:creator>Ujjawal</dc:creator>
      <dc:date>2017-11-22T21:50:53Z</dc:date>
    </item>
    <item>
      <title>Re: Distance / Similarity with Events</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Distance-Similarity-with-Events/m-p/415693#M21811</link>
      <description>&lt;P&gt;PROC DISCRIM and/or LOGISTIC REGRESSION.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You have a small event rate though so you also need to account for that.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;FYI - Fraud analytics is essentially an unsupervised problem -&amp;gt; we don’t know exactly what the categories are. It’s in a lot of respects an unsolved problem to date and SAS has a Fraud Analytics tool specifically focused on Fraud Analytics.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Are you using EM or Base SAS?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/52588"&gt;@Ujjawal&lt;/a&gt; wrote:&lt;BR /&gt;
&lt;DIV class="post-text"&gt;
&lt;P&gt;I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?&lt;/P&gt;
&lt;/DIV&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Nov 2017 22:11:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Distance-Similarity-with-Events/m-p/415693#M21811</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2017-11-22T22:11:38Z</dc:date>
    </item>
    <item>
      <title>Re: Distance / Similarity with Events</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/Distance-Similarity-with-Events/m-p/415710#M21812</link>
      <description>&lt;P&gt;Look at &lt;STRONG&gt;HPSPLIT&lt;/STRONG&gt; to build a simple model for fraud classification. Then look at the non-fraud cases missclassified as fraud. If you are right (and somewhat Lucky)&amp;nbsp;some of those&amp;nbsp;should be overlooked fraud cases.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Nov 2017 22:54:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/Distance-Similarity-with-Events/m-p/415710#M21812</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2017-11-22T22:54:51Z</dc:date>
    </item>
  </channel>
</rss>

