DATA Step, Macro, Functions and more

data cleansing problem

Respected Advisor
Posts: 4,736

data cleansing problem

Hi all

I'm fighting with "dirty" data which I try to cleanse.

I prepared below some example data to illustrate what I have.

The issue is with values for variable sum_to_date.

The values should be in ascending order (by date) but reality looks very much like the test date.

What I would need is "streamlined" data, meaning that whenever data is obviously wrong (missing, obviously out of range, not ascending at a later date) I should use some interpolation algorithm and calculate an approximat value.

All my attempts so far ended in some cumbersome code which didn't cather for all cases.

It feels like I should use some statistical approach but I'm not sure how to do this.

SAS/Stat is licensed (ETS is not).

The data in reality are odo metre readings at time when people re-fill their tanks (so time intervals should be taken in account when calculating values).
I will have only a few to around 200 records per ID.

Any suggestions of how to solve this very welcome.

data have;
infile datalines dsd dlm=' ' truncover;
input id date:date9. sum_to_date;
format date date9.;
1 01Oct10 5
1 05Oct10 10
1 07Oct10 20
1 10Oct10 300
1 11Oct10 28
1 14Oct10 35
1 17Oct10 37
1 22Oct10 45
1 29Oct10 55
1 31Oct10 60
2 01Oct10 5
2 05Oct10
2 07Oct10
2 10Oct10 300
2 11Oct10 28
2 14Oct10 35
2 17Oct10 37
2 22Oct10 45
2 29Oct10 55
2 31Oct10 60
3 01Oct10
3 05Oct10 10
3 07Oct10
3 10Oct10 300
3 11Oct10
3 14Oct10
3 17Oct10 37
3 22Oct10 45
3 29Oct10
3 31Oct10
4 01Oct10 5
4 05Oct10 10
4 07Oct10 20
4 10Oct10 30
4 11Oct10 25
4 14Oct10 28
4 17Oct10 37
4 22Oct10 45
4 29Oct10 50
4 31Oct10 60

Super Contributor
Super Contributor
Posts: 3,176

Re: data cleansing problem

Suggest you look at DATA step and BY GROUP processing, along with possibly the LAGn function -- and combined with FIRST. and/or LAST. with your ID variable. All this would be after reading up your data, then sorting the incoming data by ID and DATE.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search arguments, this topic / post:

by group processing
Trusted Advisor
Posts: 2,125

Re: data cleansing problem

It sounds like you are willing to believe the dates but not the ODO readings themselves. In that case, you can the sort and lag approach that Scott described to identify the errant values for the ODO readings. Then you could change them to missings and apply PROC MI to get a principled approach to interpretation.
Respected Advisor
Posts: 4,736

Re: data cleansing problem


The dates are very likely correct as they are connected to billing (and therefore validated), the odo readings aren't.

I used a data step approach with reading ahead to get the boundaries of "good" readings - and then calculate the "missing" values.

But as I said: The code got just too complicated. May be everything will now go away after a good nights' sleep :-)

I'll have a look into PROC MI - may be this is the missing piece in the puzzle to end up with some "good looking" code.

Super Contributor
Super Contributor
Posts: 3,176

Re: data cleansing problem

Suggest you share your code, for constructive feedback. Not sure how you would accomplish the look-ahead, unless you sorted the data in descending DATE order within ascending ID. How would you otherwise know how many to look ahead? Best to share what code you have for this type of post.

Scott Barry
SBBWorks, Inc.
Valued Guide
Posts: 2,191

Re: data cleansing problem

in by-group processing, you could store customer readings in an array indexed on reading date
array creading(&earliest:&latest) _temporary_ ;
you'll know how big that array might be.
A data step could easily store thousands of array elements,
Then you can manipulate "customer experience" as you need - flag bills without readings - assess and adjust for (even) daily consumption - seek std dev to assess unlikely readings ....
lots of options
Ask a Question
Discussion stats
  • 5 replies
  • 4 in conversation