So I am 100% new to predictive analytics and I just got back from SAS training last week. Forgive my novice questions. My co-worker and I are trying to figure out the appropriate format for our dataset. We are looking at institutions that receive federal funding and we're trying to predict which institutions will have significant problems and are therefore are a higher risk. This prediction will be used to select auditees. The dependent variable we are using is the type of audit report. We have several questions regarding our datasets, specifically our sample dataset we use to create our model. First, our dataset has the same institution included multiple times because they were audited multiple times. This is an example of what I mean: DUNS ID AUDITYEAR 000323667 2001 000323667 2002 000323667 2003 067211318 2005 067211318 2006 067211318 2007 We want our output to be an estimate of the probability that these institutions will have problems, however we are concerned that we will get an estimate for each institution in each year. What we want is an overall estimate for each institution, regardless of year. Can we keep our data in this format and get an output like that? Another concern we have regards the sample data that we will be using to estimate the model. Not everyone is our dataset has been audited and therefore there will be no audit report (which is the dependent variable) for every institution. Should we limit the sample dataset to only those institutions that have been audited for the purpose of estimating our model?
... View more