<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Cleaning randomly messy data in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Cleaning-randomly-messy-data/m-p/498193#M132330</link>
    <description>&lt;P&gt;So let's say I have the following fields:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE class="  language-sas"&gt;DATA HAVE;&lt;BR /&gt;INFILE DATALINES ;&lt;BR /&gt; input ID NAME $ STREET $ CITY $ STATE $ POSTCODE $ RELATIONSHIP $ STATUS $ PURPOSE $ DONATION $;&lt;BR /&gt; DATALINES;&lt;BR /&gt;201 AAA Market Philadelphia PA 4109 Parent Open Counselling 10000&lt;BR /&gt;201 ABC Chestnut Arlington TX 1093 None Open General 1500&lt;BR /&gt;201 BCD Walnut Walnut Sidney NY 3201 None Open General &lt;BR /&gt;1999 201 EFG Cross Kansas TX 1091 Parent Close Sports &lt;BR /&gt;1491 202 EFG Cluedo Street Phoenix AZ 2012 Close General 1900&lt;BR /&gt;;&lt;BR /&gt;RUN;&lt;/PRE&gt;&lt;P&gt;Which gives me the following output:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="13.JPG" style="width: 600px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/23470iA7CAC62B69EEF9E3/image-size/large?v=v2&amp;amp;px=999" role="button" title="13.JPG" alt="13.JPG" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can see there are&amp;nbsp;three problems here:&lt;/P&gt;&lt;P&gt;1) The street "walnut" has been imported twice, shifting the values in the columns by 1 extra space incorrectly.&lt;/P&gt;&lt;P&gt;2) The street "Cluedo Street" has been imported over two lines, instead of just one line, causing a similar problem to what was mentioned above.&lt;/P&gt;&lt;P&gt;3) There is an omission for "Relationship" in the final row.&amp;nbsp;Where the incorrectly imported data should read "none", it missing altogether, and reads "close", so even in the absence of the first two errors, the "Relationship" Column here would read "close" instead of "none".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Let's suppose there are 1000's of issues similar, but not identical to the ones above, in a data set with millions of observations. They will be similar in the sense that it's usually a random or repeated omission for some finite number of fields, OR, duplicate values have been entered, or values have spanned more columns than they should have.&lt;/P&gt;&lt;P&gt;Assuming the exact same column names as above, is it plausible that one could create some kind of criteria, or program that could reasonably adjust most of these issues? Or do I simply need to request a cleaner data set?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 23 Sep 2018 09:20:18 GMT</pubDate>
    <dc:creator>UniversitySas</dc:creator>
    <dc:date>2018-09-23T09:20:18Z</dc:date>
    <item>
      <title>Cleaning randomly messy data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Cleaning-randomly-messy-data/m-p/498193#M132330</link>
      <description>&lt;P&gt;So let's say I have the following fields:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE class="  language-sas"&gt;DATA HAVE;&lt;BR /&gt;INFILE DATALINES ;&lt;BR /&gt; input ID NAME $ STREET $ CITY $ STATE $ POSTCODE $ RELATIONSHIP $ STATUS $ PURPOSE $ DONATION $;&lt;BR /&gt; DATALINES;&lt;BR /&gt;201 AAA Market Philadelphia PA 4109 Parent Open Counselling 10000&lt;BR /&gt;201 ABC Chestnut Arlington TX 1093 None Open General 1500&lt;BR /&gt;201 BCD Walnut Walnut Sidney NY 3201 None Open General &lt;BR /&gt;1999 201 EFG Cross Kansas TX 1091 Parent Close Sports &lt;BR /&gt;1491 202 EFG Cluedo Street Phoenix AZ 2012 Close General 1900&lt;BR /&gt;;&lt;BR /&gt;RUN;&lt;/PRE&gt;&lt;P&gt;Which gives me the following output:&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="13.JPG" style="width: 600px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/23470iA7CAC62B69EEF9E3/image-size/large?v=v2&amp;amp;px=999" role="button" title="13.JPG" alt="13.JPG" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can see there are&amp;nbsp;three problems here:&lt;/P&gt;&lt;P&gt;1) The street "walnut" has been imported twice, shifting the values in the columns by 1 extra space incorrectly.&lt;/P&gt;&lt;P&gt;2) The street "Cluedo Street" has been imported over two lines, instead of just one line, causing a similar problem to what was mentioned above.&lt;/P&gt;&lt;P&gt;3) There is an omission for "Relationship" in the final row.&amp;nbsp;Where the incorrectly imported data should read "none", it missing altogether, and reads "close", so even in the absence of the first two errors, the "Relationship" Column here would read "close" instead of "none".&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Let's suppose there are 1000's of issues similar, but not identical to the ones above, in a data set with millions of observations. They will be similar in the sense that it's usually a random or repeated omission for some finite number of fields, OR, duplicate values have been entered, or values have spanned more columns than they should have.&lt;/P&gt;&lt;P&gt;Assuming the exact same column names as above, is it plausible that one could create some kind of criteria, or program that could reasonably adjust most of these issues? Or do I simply need to request a cleaner data set?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 23 Sep 2018 09:20:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Cleaning-randomly-messy-data/m-p/498193#M132330</guid>
      <dc:creator>UniversitySas</dc:creator>
      <dc:date>2018-09-23T09:20:18Z</dc:date>
    </item>
    <item>
      <title>Re: Cleaning randomly messy data</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Cleaning-randomly-messy-data/m-p/498198#M132334</link>
      <description>&lt;P&gt;If you have blanks in data items, you need to use a delimiter other than blank, or have the data items enclosed in quotes, so the dsd option can be used.&lt;/P&gt;</description>
      <pubDate>Sun, 23 Sep 2018 09:38:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Cleaning-randomly-messy-data/m-p/498198#M132334</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2018-09-23T09:38:09Z</dc:date>
    </item>
  </channel>
</rss>

