<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Data Cleaning in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706552#M216860</link>
    <description>&lt;P&gt;Its a huge dataset, but if I run a proc freq it should show me the different values and then I could input those that pop up right?&lt;/P&gt;</description>
    <pubDate>Thu, 17 Dec 2020 07:27:34 GMT</pubDate>
    <dc:creator>shortyofhb</dc:creator>
    <dc:date>2020-12-17T07:27:34Z</dc:date>
    <item>
      <title>Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706547#M216855</link>
      <description>&lt;P&gt;Hello all,&lt;/P&gt;&lt;P&gt;I have a data cleaning question. I was given an external dataset that I imported into SAS. One of the tasks is to clean the dataset and rename the variables.&lt;/P&gt;&lt;P&gt;How would I go about cleaning a variable that has multiple different values. For example, my gender variable has Female, Femalee, Fame, emale, etc and Male, Malee, ale, Maale, etc.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I initially tried to use an If, Then statement but didn't know how to do it since not all the values start with M or F.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 07:11:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706547#M216855</guid>
      <dc:creator>shortyofhb</dc:creator>
      <dc:date>2020-12-17T07:11:00Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706549#M216857</link>
      <description>&lt;P&gt;Let's start with some logic here. What if the value is "&lt;SPAN&gt;mal"? Do you want it to be Make or Female? Could be both?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 07:19:20 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706549#M216857</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2020-12-17T07:19:20Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706550#M216858</link>
      <description>&lt;P&gt;I think if it is just "mal" I would like to make it "Male"&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 07:20:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706550#M216858</guid>
      <dc:creator>shortyofhb</dc:creator>
      <dc:date>2020-12-17T07:20:40Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706551#M216859</link>
      <description>&lt;P&gt;Here is a brute force mehtod. See if you can use that as a template.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
input gender $;
datalines;
Female  
Femalee 
Fame    
emale   
Male    
Malee   
ale     
Maale   
;

data want;
   set have;
   if      gender in ("Female", "Femalee", "Fame", "emale") then newgender = "Female";
   else if gender in ("Male", "Malee", "ale", "Maale") then newgender = "Male";
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 17 Dec 2020 07:24:38 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706551#M216859</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2020-12-17T07:24:38Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706552#M216860</link>
      <description>&lt;P&gt;Its a huge dataset, but if I run a proc freq it should show me the different values and then I could input those that pop up right?&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 07:27:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706552#M216860</guid>
      <dc:creator>shortyofhb</dc:creator>
      <dc:date>2020-12-17T07:27:34Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706553#M216861</link>
      <description>&lt;P&gt;I have to do the same thing but with cities, do you know how I would incorporate propcase into it?&lt;/P&gt;&lt;P&gt;Would I do the same, and then just make a new function that propases the new variable name?&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 07:31:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706553#M216861</guid>
      <dc:creator>shortyofhb</dc:creator>
      <dc:date>2020-12-17T07:31:21Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706570#M216871</link>
      <description>&lt;P&gt;Few suggestions to think about:&lt;/P&gt;
&lt;P&gt;Step 1 - compress any pair of sequence letters (like 'aa' or 'ee' etc.) to a single one ('a' or 'e' etc.).&lt;/P&gt;
&lt;P&gt;Step2 -&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If 1st letter is capital 'F' then it is Female, or if it is 'M' then it is Male otherwise:&lt;/P&gt;
&lt;P&gt;if length(gender) &amp;gt; 4 then it is Female else it is unknown, because 'ale' can be both.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you check all unknown gender you may find some more rules to decide is it a Male or Female. (for example 'eminin' as part of 'Feminine' means 'Female' or 'Nale' is 'Male' as 'N' and 'M' are near letters in QWERT keboard);&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 09:01:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706570#M216871</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-12-17T09:01:19Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706574#M216874</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can use the complev function that computes a distance between strings (Levenshtein distance).&lt;/P&gt;
&lt;P&gt;&lt;A href="https://documentation.sas.com/?cdcId=pgmsascdc&amp;amp;cdcVersion=9.4_3.5&amp;amp;docsetId=lefunctionsref&amp;amp;docsetTarget=n0l41pdemybegln1oetsh4cctdap.htm&amp;amp;locale=en" target="_blank"&gt;https://documentation.sas.com/?cdcId=pgmsascdc&amp;amp;cdcVersion=9.4_3.5&amp;amp;docsetId=lefunctionsref&amp;amp;docsetTarget=n0l41pdemybegln1oetsh4cctdap.htm&amp;amp;locale=en&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Using &lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/31304"&gt;@PeterClemmensen&lt;/a&gt; dataset :&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
input gender $;
datalines;
Female  
Femalee 
Fame    
emale   
Male    
Malee   
ale     
Maale   
;

data want;
   set have;
   gender2=ifc(complev(gender,"MALE","i")&amp;lt;complev(gender,"FEMALE","i"), "Male", "Female");
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Note that "Fame" becomes "Male" so you can add other criteria that fit better the actual contents of your dataset.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 09:38:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706574#M216874</guid>
      <dc:creator>gamotte</dc:creator>
      <dc:date>2020-12-17T09:38:43Z</dc:date>
    </item>
    <item>
      <title>Re: Data Cleaning</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706676#M216905</link>
      <description>&lt;P&gt;One thing, what should the result of the "cleaning" be?&lt;/P&gt;
&lt;P&gt;Second, did anyone provide a document of what the acceptable answers should be?&lt;/P&gt;
&lt;P&gt;Third by "import" do you mean you used Proc Import to read the data?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With data that has expected values but questionable data entry&amp;nbsp; I quite often write custom informat to read the expected values and adjust it to correct bad values as they occur. I use the _error_ option to write notes to the log of unexpected values so I can add them to the informat.&lt;/P&gt;
&lt;P&gt;Then use a data step to read the data because Proc Import cannot do any such.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;A brief example of the process.&lt;/P&gt;
&lt;PRE&gt;proc format ;
invalue $gender (upcase)
'FEMALE' = 'F'
'MALE'   = 'M'
other    = _error_
;

data example;
   input x :$gender.;
datalines;
Male
mAle
male
female
Female
FemALE
feemale
mal
maale
;
&lt;/PRE&gt;
&lt;P&gt;This will create values of F and M (you did not indicate what you wanted for the values) or missing when something not in the informat is encountered.&lt;/P&gt;
&lt;P&gt;The log for the data step looks like:&lt;/P&gt;
&lt;PRE&gt;28   data example;
29      input x :gender.;
30   datalines;

NOTE: Invalid data for x in line 37 1-7.
RULE:      ----+----1----+----2----+----3----+----4----+----5---
37         feemale
x=. _ERROR_=1 _N_=7
NOTE: Invalid data for x in line 38 1-3.
38         mal
x=. _ERROR_=1 _N_=8
NOTE: Invalid data for x in line 39 1-5.
39         maale
x=. _ERROR_=1 _N_=9
NOTE: The data set WORK.EXAMPLE has 9 observations and 1
      variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time
&lt;/PRE&gt;
&lt;P&gt;The UPCASE option means any value encountered is converted to all uppercase before comparison with values in the Informat. So I can use the LOG information to add the &lt;STRONG&gt;upper case&lt;/STRONG&gt; versions of the bad values to the Informat.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;proc format ;
invalue $gender (upcase)
'FEMALE' ,'FEEMALE' = 'F'
'MALE','MAL','MAALE'   = 'M'
other    = _error_
;

data example;
   input x :$gender.;
datalines;
Male
mAle
male
female
Female
FemALE
feemale
mal
maale
;

&lt;/PRE&gt;
&lt;P&gt;I use this also for things like location codes as my data sources will often forget to tell me when a new site is activated. So I have a "valid site code" format that throws an error like above when the new code appears in the data so I can ask about the related details such as "where is the site", "what is its name" and relevant items used in reporting.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The above approach will recode any acceptable text though embedded quotes get a very little bit tricky.&lt;/P&gt;
&lt;P&gt;If some of the values are expected to be blank you can use a&amp;nbsp; ' ' = some code to let you know they were blank or missing.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 17 Dec 2020 15:53:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Data-Cleaning/m-p/706676#M216905</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2020-12-17T15:53:23Z</dc:date>
    </item>
  </channel>
</rss>

