I am working on building a fraud model for insurance company. I have close to 2000 frauds and more than 1M non-fraud claims. Some of the "non-fraud" claims are "fraud" claims in reality as some claims are incorrectly captured and tagged as "non-frauds" in data. I need to identify the claims which are likely to be fraud but tagged as non-fraud. I was thinking to find similarity (distance) between "fraud claims" and "non-fraud claims". If the similarity is low, it means these are non-fraud claims. Can clustering (k-mean) solve this problem? If i take k=2 and run k-mean clustering, ideally all [my fraud plus "can-be fraud"] claims and non-fraud claims should fall under different clusters. I have mixed variables so k-mean won't work properly. Any other algorithm to solve this problem?