BookmarkSubscribeRSS Feed
Patrick
Opal | Level 21
Hi all

I'm fighting with "dirty" data which I try to cleanse.

I prepared below some example data to illustrate what I have.

The issue is with values for variable sum_to_date.

The values should be in ascending order (by date) but reality looks very much like the test date.

What I would need is "streamlined" data, meaning that whenever data is obviously wrong (missing, obviously out of range, not ascending at a later date) I should use some interpolation algorithm and calculate an approximat value.

All my attempts so far ended in some cumbersome code which didn't cather for all cases.

It feels like I should use some statistical approach but I'm not sure how to do this.

SAS/Stat is licensed (ETS is not).

The data in reality are odo metre readings at time when people re-fill their tanks (so time intervals should be taken in account when calculating values).
I will have only a few to around 200 records per ID.

Any suggestions of how to solve this very welcome.


data have;
infile datalines dsd dlm=' ' truncover;
input id date:date9. sum_to_date;
format date date9.;
datalines;
1 01Oct10 5
1 05Oct10 10
1 07Oct10 20
1 10Oct10 300
1 11Oct10 28
1 14Oct10 35
1 17Oct10 37
1 22Oct10 45
1 29Oct10 55
1 31Oct10 60
2 01Oct10 5
2 05Oct10
2 07Oct10
2 10Oct10 300
2 11Oct10 28
2 14Oct10 35
2 17Oct10 37
2 22Oct10 45
2 29Oct10 55
2 31Oct10 60
3 01Oct10
3 05Oct10 10
3 07Oct10
3 10Oct10 300
3 11Oct10
3 14Oct10
3 17Oct10 37
3 22Oct10 45
3 29Oct10
3 31Oct10
4 01Oct10 5
4 05Oct10 10
4 07Oct10 20
4 10Oct10 30
4 11Oct10 25
4 14Oct10 28
4 17Oct10 37
4 22Oct10 45
4 29Oct10 50
4 31Oct10 60
;
run;


Thanks
Patrick
5 REPLIES 5
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
Suggest you look at DATA step and BY GROUP processing, along with possibly the LAGn function -- and combined with FIRST. and/or LAST. with your ID variable. All this would be after reading up your data, then sorting the incoming data by ID and DATE.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search arguments, this topic / post:

by group processing site:sas.com
Doc_Duke
Rhodochrosite | Level 12
It sounds like you are willing to believe the dates but not the ODO readings themselves. In that case, you can the sort and lag approach that Scott described to identify the errant values for the ODO readings. Then you could change them to missings and apply PROC MI to get a principled approach to interpretation.
Patrick
Opal | Level 21
Hi

The dates are very likely correct as they are connected to billing (and therefore validated), the odo readings aren't.

I used a data step approach with reading ahead to get the boundaries of "good" readings - and then calculate the "missing" values.

But as I said: The code got just too complicated. May be everything will now go away after a good nights' sleep 🙂

I'll have a look into PROC MI - may be this is the missing piece in the puzzle to end up with some "good looking" code.

Thanks
Patrick
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
Suggest you share your code, for constructive feedback. Not sure how you would accomplish the look-ahead, unless you sorted the data in descending DATE order within ascending ID. How would you otherwise know how many to look ahead? Best to share what code you have for this type of post.

Scott Barry
SBBWorks, Inc.
Peter_C
Rhodochrosite | Level 12
in by-group processing, you could store customer readings in an array indexed on reading date
array creading(&earliest:&latest) _temporary_ ;
you'll know how big that array might be.
A data step could easily store thousands of array elements,
Then you can manipulate "customer experience" as you need - flag bills without readings - assess and adjust for (even) daily consumption - seek std dev to assess unlikely readings ....
lots of options

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 950 views
  • 0 likes
  • 4 in conversation