I'm fighting with "dirty" data which I try to cleanse.
I prepared below some example data to illustrate what I have.
The issue is with values for variable sum_to_date.
The values should be in ascending order (by date) but reality looks very much like the test date.
What I would need is "streamlined" data, meaning that whenever data is obviously wrong (missing, obviously out of range, not ascending at a later date) I should use some interpolation algorithm and calculate an approximat value.
All my attempts so far ended in some cumbersome code which didn't cather for all cases.
It feels like I should use some statistical approach but I'm not sure how to do this.
SAS/Stat is licensed (ETS is not).
The data in reality are odo metre readings at time when people re-fill their tanks (so time intervals should be taken in account when calculating values).
I will have only a few to around 200 records per ID.
Any suggestions of how to solve this very welcome.
Suggest you look at DATA step and BY GROUP processing, along with possibly the LAGn function -- and combined with FIRST. and/or LAST. with your ID variable. All this would be after reading up your data, then sorting the incoming data by ID and DATE.
Suggested Google advanced search arguments, this topic / post:
It sounds like you are willing to believe the dates but not the ODO readings themselves. In that case, you can the sort and lag approach that Scott described to identify the errant values for the ODO readings. Then you could change them to missings and apply PROC MI to get a principled approach to interpretation.
Suggest you share your code, for constructive feedback. Not sure how you would accomplish the look-ahead, unless you sorted the data in descending DATE order within ascending ID. How would you otherwise know how many to look ahead? Best to share what code you have for this type of post.
in by-group processing, you could store customer readings in an array indexed on reading date
array creading(&earliest:&latest) _temporary_ ;
you'll know how big that array might be.
A data step could easily store thousands of array elements,
Then you can manipulate "customer experience" as you need - flag bills without readings - assess and adjust for (even) daily consumption - seek std dev to assess unlikely readings ....
lots of options