<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: data cleansing problem in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18326#M2751</link>
    <description>Suggest you share your code, for constructive feedback.  Not sure how you would accomplish the look-ahead, unless you sorted the data in descending DATE order within ascending ID.  How would you otherwise know how many to look ahead?  Best to share what code you have for this type of post.&lt;BR /&gt;
&lt;BR /&gt;
Scott Barry&lt;BR /&gt;
SBBWorks, Inc.</description>
    <pubDate>Mon, 18 Oct 2010 21:33:30 GMT</pubDate>
    <dc:creator>sbb</dc:creator>
    <dc:date>2010-10-18T21:33:30Z</dc:date>
    <item>
      <title>data cleansing problem</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18322#M2747</link>
      <description>Hi all&lt;BR /&gt;
&lt;BR /&gt;
I'm fighting with "dirty" data which I try to cleanse.&lt;BR /&gt;
&lt;BR /&gt;
I prepared below some example data to illustrate what I have.&lt;BR /&gt;
&lt;BR /&gt;
The issue is with values for variable sum_to_date.&lt;BR /&gt;
&lt;BR /&gt;
The values should be in ascending order (by date) but reality looks very much like the test date.&lt;BR /&gt;
&lt;BR /&gt;
What I would need is "streamlined" data, meaning that whenever data is obviously wrong (missing, obviously out of range, not ascending at a later date) I should use some interpolation algorithm and calculate an approximat value.&lt;BR /&gt;
&lt;BR /&gt;
All my attempts so far ended in some cumbersome code which didn't cather for all cases.&lt;BR /&gt;
&lt;BR /&gt;
It feels like I should use some statistical approach but I'm not sure how to do this.&lt;BR /&gt;
&lt;BR /&gt;
SAS/Stat is licensed (ETS is not).&lt;BR /&gt;
&lt;BR /&gt;
The data in reality are odo metre readings at time when people re-fill their tanks (so time intervals should be taken in account when calculating values).&lt;BR /&gt;
I will have only a few to around 200 records per ID.&lt;BR /&gt;
&lt;BR /&gt;
Any suggestions of how to solve this very welcome.&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
data have;&lt;BR /&gt;
infile datalines dsd dlm=' ' truncover;&lt;BR /&gt;
input id date:date9. sum_to_date;&lt;BR /&gt;
format date date9.;&lt;BR /&gt;
datalines;&lt;BR /&gt;
1 01Oct10 5&lt;BR /&gt;
1 05Oct10 10&lt;BR /&gt;
1 07Oct10 20&lt;BR /&gt;
1 10Oct10 300&lt;BR /&gt;
1 11Oct10 28&lt;BR /&gt;
1 14Oct10 35&lt;BR /&gt;
1 17Oct10 37&lt;BR /&gt;
1 22Oct10 45&lt;BR /&gt;
1 29Oct10 55&lt;BR /&gt;
1 31Oct10 60&lt;BR /&gt;
2 01Oct10 5&lt;BR /&gt;
2 05Oct10  &lt;BR /&gt;
2 07Oct10  &lt;BR /&gt;
2 10Oct10 300&lt;BR /&gt;
2 11Oct10 28&lt;BR /&gt;
2 14Oct10 35&lt;BR /&gt;
2 17Oct10 37&lt;BR /&gt;
2 22Oct10 45&lt;BR /&gt;
2 29Oct10 55&lt;BR /&gt;
2 31Oct10 60&lt;BR /&gt;
3 01Oct10  &lt;BR /&gt;
3 05Oct10 10&lt;BR /&gt;
3 07Oct10  &lt;BR /&gt;
3 10Oct10 300&lt;BR /&gt;
3 11Oct10  &lt;BR /&gt;
3 14Oct10 &lt;BR /&gt;
3 17Oct10 37&lt;BR /&gt;
3 22Oct10 45&lt;BR /&gt;
3 29Oct10  &lt;BR /&gt;
3 31Oct10  &lt;BR /&gt;
4 01Oct10 5  &lt;BR /&gt;
4 05Oct10 10&lt;BR /&gt;
4 07Oct10 20 &lt;BR /&gt;
4 10Oct10 30&lt;BR /&gt;
4 11Oct10 25 &lt;BR /&gt;
4 14Oct10 28&lt;BR /&gt;
4 17Oct10 37&lt;BR /&gt;
4 22Oct10 45&lt;BR /&gt;
4 29Oct10 50 &lt;BR /&gt;
4 31Oct10 60 &lt;BR /&gt;
;&lt;BR /&gt;
run;&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
Thanks&lt;BR /&gt;
Patrick</description>
      <pubDate>Mon, 18 Oct 2010 10:17:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18322#M2747</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2010-10-18T10:17:33Z</dc:date>
    </item>
    <item>
      <title>Re: data cleansing problem</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18323#M2748</link>
      <description>Suggest you look at DATA step and BY GROUP processing, along with possibly the LAGn function -- and combined with FIRST.&lt;BREAK_VARIABLE&gt; and/or LAST.&lt;BREAK_VARIABLE&gt; with your ID variable.  All this would be after reading up your data, then sorting the incoming data by ID and DATE.&lt;BR /&gt;
&lt;BR /&gt;
Scott Barry&lt;BR /&gt;
SBBWorks, Inc.&lt;BR /&gt;
&lt;BR /&gt;
Suggested Google advanced search arguments, this topic / post:&lt;BR /&gt;
&lt;BR /&gt;
by group processing site:sas.com&lt;/BREAK_VARIABLE&gt;&lt;/BREAK_VARIABLE&gt;</description>
      <pubDate>Mon, 18 Oct 2010 14:54:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18323#M2748</guid>
      <dc:creator>sbb</dc:creator>
      <dc:date>2010-10-18T14:54:19Z</dc:date>
    </item>
    <item>
      <title>Re: data cleansing problem</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18324#M2749</link>
      <description>It sounds like you are willing to believe the dates but not the ODO readings themselves.  In that case, you can the sort and lag approach that Scott described to identify the errant values for the ODO readings.  Then you could change them to missings and apply PROC MI to get a principled approach to interpretation.</description>
      <pubDate>Mon, 18 Oct 2010 19:06:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18324#M2749</guid>
      <dc:creator>Doc_Duke</dc:creator>
      <dc:date>2010-10-18T19:06:17Z</dc:date>
    </item>
    <item>
      <title>Re: data cleansing problem</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18325#M2750</link>
      <description>Hi&lt;BR /&gt;
&lt;BR /&gt;
The dates are very likely correct as they are connected to billing (and therefore validated), the odo readings aren't.&lt;BR /&gt;
&lt;BR /&gt;
I used a data step approach with reading ahead to get the boundaries of "good" readings - and then calculate the "missing" values. &lt;BR /&gt;
&lt;BR /&gt;
But as I said: The code got just too complicated. May be everything will now go away after a good nights' sleep &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;
&lt;BR /&gt;
I'll have a look into PROC MI - may be this is the missing piece in the puzzle to end up with some "good looking" code.&lt;BR /&gt;
&lt;BR /&gt;
Thanks&lt;BR /&gt;
Patrick</description>
      <pubDate>Mon, 18 Oct 2010 20:24:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18325#M2750</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2010-10-18T20:24:38Z</dc:date>
    </item>
    <item>
      <title>Re: data cleansing problem</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18326#M2751</link>
      <description>Suggest you share your code, for constructive feedback.  Not sure how you would accomplish the look-ahead, unless you sorted the data in descending DATE order within ascending ID.  How would you otherwise know how many to look ahead?  Best to share what code you have for this type of post.&lt;BR /&gt;
&lt;BR /&gt;
Scott Barry&lt;BR /&gt;
SBBWorks, Inc.</description>
      <pubDate>Mon, 18 Oct 2010 21:33:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18326#M2751</guid>
      <dc:creator>sbb</dc:creator>
      <dc:date>2010-10-18T21:33:30Z</dc:date>
    </item>
    <item>
      <title>Re: data cleansing problem</title>
      <link>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18327#M2752</link>
      <description>in by-group processing, you could store customer readings in an array indexed on reading date &lt;BR /&gt;
  array creading(&amp;amp;earliest:&amp;amp;latest) _temporary_ ;&lt;BR /&gt;
you'll know how big that array might be.&lt;BR /&gt;
A data step could easily store thousands of array elements,&lt;BR /&gt;
Then you can manipulate "customer experience" as you need - flag bills without readings - assess and adjust for (even) daily consumption - seek std dev to assess unlikely readings ....&lt;BR /&gt;
lots of options</description>
      <pubDate>Mon, 18 Oct 2010 22:10:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/data-cleansing-problem/m-p/18327#M2752</guid>
      <dc:creator>Peter_C</dc:creator>
      <dc:date>2010-10-18T22:10:59Z</dc:date>
    </item>
  </channel>
</rss>

