Stream Mining

hagen85 · Posted 12-17-2011 12:24 PM

Hi @ all,

for a project at university I would like to simulate a changing environment. Basically what I want to do is:

Build a prediction model (e.g. neural network) on a certain data set and simulate two things:

1.) How does the model performance change if new data instances arrive (Performance will go down sooner or later).

2.) How does the model performance change if the model is adapted after a new instance arrives. (Performance should remain constant)

Can I simulate both scenarios in SAS? How can I simulate a stream of new data instances arriving? How do I model the adaption loop mentioned under 2.) in SAS?

Thank you very much in advance for your ideas.

Best Regards

Hagen

DougWielenga · Posted 08-17-2017 09:52 AM

Basically what I want to do is:

Build a prediction model (e.g. neural network) on a certain data set and simulate two things:

1.) How does the model performance change if new data instances arrive (Performance will go down sooner or later).

2.) How does the model performance change if the model is adapted after a new instance arrives. (Performance should remain constant)

For number 1 above, you can simulate data using Base SAS but you need to describe how you want to simulate new observations. For example, you might try and draw random values from a certain distribution for each of your variables and then build new observations to score with the 'new' data. If you believe there is a trend in the values (e.g. children's average height at a given age seems to increase each decade) then you can build that shift into the randomization. SAS provides several probability distributions that you can use to simulate data. Start with the RAND function in SAS and then (if desired) model the shift in your inputs over time and build that estimated drift/shift into your simulated data. Then you can score the data. Performance going down over time is expected since populations change, but you can monitor performance over time to see how much the performance has dropped at a given time. When it gets too low, you can refit.

For number 2 above, there is really no such thing as 'tweaking' a model. If you get new observations and run the model again, you are really just fitting a new model. The benefit to using an existing model is that it allows you to score observations for which you don't know the answer yet. If you continually update your model, you might get slightly better performance or the difference might be negligible which is why it is likely better to monitor model performance and then refit as needed. The alternative might be with nearest neighbor modeling strategies where you are looking to cluster your observations and predict their value from their nearest neighbors. As your data increases, you have more information to make these assessments. There is no 'model' that is fit; you are just attempting to identify people who are the most like your new observation based on the available data. Of course, once the new persons outcome is known, they just become part of the training for the next new observation.

I hope this helps!

Doug

Stream Mining

Re: Stream Mining