So I am 100% new to predictive analytics and I just got back from SAS training last week. Forgive my novice questions. My co-worker and I are trying to figure out the appropriate format for our dataset. We are looking at institutions that receive federal funding and we're trying to predict which institutions will have significant problems and are therefore are a higher risk. This prediction will be used to select auditees. The dependent variable we are using is the type of audit report. We have several questions regarding our datasets, specifically our sample dataset we use to create our model. First, our dataset has the same institution included multiple times because they were audited multiple times. This is an example of what I mean:
DUNS ID AUDITYEAR
000323667 2001
000323667 2002
000323667 2003
067211318 2005
067211318 2006
067211318 2007
We want our output to be an estimate of the probability that these institutions will have problems, however we are concerned that we will get an estimate for each institution in each year. What we want is an overall estimate for each institution, regardless of year. Can we keep our data in this format and get an output like that?
Another concern we have regards the sample data that we will be using to estimate the model. Not everyone is our dataset has been audited and therefore there will be no audit report (which is the dependent variable) for every institution. Should we limit the sample dataset to only those institutions that have been audited for the purpose of estimating our model?
I don't think that's an appropriate set up. Perhaps:
DUNS_ID, Num_Audits_2005 or something like that?
I'm assuming the audit report will actually indicate if there is are significant issues, rather than wether or not you had an audit. There are no institutions that had significant issues that were not audited?
You bring up an interesting point. The audit report will indicate if there are significant issues. If they have an audit report they had an audit. Institutions could have significant issues if they were not audited, however we have no way of knowing about those issues without an audit having been completed - at least nothing in a dataset.
That is the right question to ask as Reeza did.
What will you predictor (target probability) be and what are/is the predictor(s).
If your assumption is the type of audit-report in a given year is a predictor for the future you could
- using an calculation of let us say the years before predicting the one in the year you are observing.
- All the other events of not having an audit is also observation. That are the other ones
shifting time patters can give a more recognizable pattern. Never knowing for sure how you should treat data to get it.
That is the hard job of analytics
So you are saying that:
- aside the information of being audit you have also the result of the audit in some classes eg from: no issues, some minor issue, several major, many major ?
- results of issues as not being reliable to get more funding? (failures)
- you could have some social media information on the reliability to be set as an negative result for them? When not, I agree with you as knowing nothing about them is making difficult for any kind of analyses.
So you are saying that:
- aside the information of being audit you have also the result of the audit in some classes eg from: no issues, some minor issue, several major, many major ?
Correct. Each audit report is issued with a level of concern ranging from none to there a big problems.
- results of issues as not being reliable to get more funding? (failures)
We're using the results of the audit as our target to try and predict other institutions to audit. We also have a bunch of other inputs such as late project reports and not filing their tax returns.
Am I correct in assuming this is tax auditor data? Have you consulted with any of the SUG's in your area?
*SUG's = SAS Users Groups
We are actually auditors with an Office of Inspector General. There is SAS User Group in our area (DC), however they seem to focus on base SAS and we are using SAS Enterprise Miner and we don't know how to use base SAS.
Hi bmoon,
Thanks for joining the community! A word on the SAS Users Group in DC - I encourage you to reach out to group contact Arthur Furnia (arthur.furnia@faa.gov) and offer SAS Enterprise Miner as a topic of interest for you. Perhaps it can be added to future meetings/discussions. Also, keep an eye out for topics at the SESUG (Southeast SAS Users Group) annual meeting. It's in Myrtle Beach this year and it's possible something there will be related to Enterprise Miner.
Anna
I think you just defined your target-value that is that level-of concern from that audit report.
That whole bunch of other inputs are your predictors.
The interesting point is that it could be not a absolute year that is relevant but a some/many preceding that. I signal of later report in eg year-4 being very relevant combined wit late tax return year-2. That is quite different thinking as your originally question/thinking.
Your model will predict that if an audit is performed the type of errors that might be found. It will not tell you which institutions to audit, as it's based on the assumption than an audit has occurred.
What's the ultimate data mining goal?
Our ultimate goal is determine who to audit. We're trying to decide who to audit based on how risk - we want to focus our efforts on the institutions that are most likely to have problems.
Some more thoughts.
Make sure you know which variables are known before the audit and which variables are as a result of the audit.
Consider a two stage model.
1. Model the chances of being audited in the first place.
2. Model the chances of having issues once been audited. Enterprise Miner has the ability to implement a two stage model. Make sure in the second stage to NOT use any variables that are from the audit otherwise you won't be able to predict future values.
A second option, which assumes that any non-audited institution is financially stable.
1. Use the levels of the audit as your target variable, possibly with decision tree or clustering models. Allow the institutions that are not audited to be the lowest level.
Hope that helps to clarify things!
Regardign the first option - we were going to use the audit information we had as the indicator that the institution had issues (our target). Would that be okay since it would be the dependent variable or would we need to find a different target variable?
Isn't the goal to reduce audits, ie audits that were not useful or found mistakes?
If you use the fact that you had an audit as an indicator you'll simply audit more that you audited before and then why implement a model at all and not use your prior method of establishing which facilities to audit?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.