<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: dealing with the worst data set in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/dealing-with-the-worst-data-set/m-p/625025#M184193</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/107435"&gt;@harrylui&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).&lt;/P&gt;
&lt;P&gt;Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).&lt;/P&gt;
&lt;P&gt;But to answer your question, I am not sure you can avoid a manual step ...&lt;/P&gt;
&lt;P&gt;The first thing is really to have a better knowledge of your data so as to be able to recode them.&lt;/P&gt;
&lt;P&gt;Best,&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
	by Club_no;
run;

proc freq data=have noprint;
	table name / out=have_freq (drop=percent);
	by Club_no;
run;

proc sort data=have_freq;
	by Club_no descending count;
run;

proc transpose data=have_freq out=want (drop=_:) prefix=name;
	var name;
	by Club_no;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sat, 15 Feb 2020 18:41:46 GMT</pubDate>
    <dc:creator>ed_sas_member</dc:creator>
    <dc:date>2020-02-15T18:41:46Z</dc:date>
    <item>
      <title>dealing with the worst data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/dealing-with-the-worst-data-set/m-p/625013#M184188</link>
      <description>&lt;P&gt;&lt;EM&gt;hi all,&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;please see my below data set.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;i am dealing with the worst data set that i have ever seen.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;many data come with wrong spelling and missing word.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;can someone help?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Name&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Club no.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;1&amp;nbsp; &amp;nbsp; Manchester united&amp;nbsp; &amp;nbsp; &amp;nbsp;1234&amp;nbsp; &amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;2&amp;nbsp; &amp;nbsp; Manchester uit&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1234&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;3&amp;nbsp; &amp;nbsp; Manchester unite&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1234&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;4&amp;nbsp; &amp;nbsp; arsenal&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;5&amp;nbsp; &amp;nbsp; arsen&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;6&amp;nbsp; &amp;nbsp; Tottenham Hotspur&amp;nbsp; &amp;nbsp; &amp;nbsp;3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;7&amp;nbsp; &amp;nbsp; Tottenham Hotspu&amp;nbsp; &amp;nbsp; &amp;nbsp; 3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;8&amp;nbsp; &amp;nbsp; Manchester city&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 4321&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;9&amp;nbsp; &amp;nbsp; Manchester cit&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 4321&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;10&amp;nbsp; &amp;nbsp; laker&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;7890&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;11&amp;nbsp; &amp;nbsp; lake&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;7890&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;12&amp;nbsp; liverpoo ncc&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;3333&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;13&amp;nbsp; liverpool&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3333&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;what i want&amp;nbsp;is can i write a program with below logical in do loop ?&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;if row1&amp;nbsp;Club No. = row2 Club No.&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;and dif=compged(Name1,Name2)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;if dif &amp;gt;= 70 then do;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;name1 = name2;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;end;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;what i expect to get is&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Name&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Club no.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;1&amp;nbsp; &amp;nbsp; Manchester united&amp;nbsp; &amp;nbsp; &amp;nbsp;1234&amp;nbsp; &amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;2&amp;nbsp; &amp;nbsp; Manchester united &amp;nbsp; &amp;nbsp; 1234&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;3&amp;nbsp; &amp;nbsp; Manchester united &amp;nbsp; &amp;nbsp; 1234&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;4&amp;nbsp; &amp;nbsp; arsenal&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;5&amp;nbsp; &amp;nbsp; arsenal &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;6&amp;nbsp; &amp;nbsp; Tottenham Hotspur&amp;nbsp; &amp;nbsp; &amp;nbsp;3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;7&amp;nbsp; &amp;nbsp; Tottenham Hotspur&amp;nbsp; &amp;nbsp; &amp;nbsp;3214&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;8&amp;nbsp; &amp;nbsp; Manchester city&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 4321&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;9&amp;nbsp; &amp;nbsp; Manchester city&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 4321&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;10&amp;nbsp; laker&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 7890&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;11&amp;nbsp; laker&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 7890&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;12&amp;nbsp; liverpoo ncc&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3333&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;13&amp;nbsp; liverpoo ncc&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 3333&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;and by sorting the data with no duplicate&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;i can get a unique name list and that is my purpose&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;expecting some spelling error.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;the hardest part is some club share with same club no. and there are millions of data and might be there are thousands of unique club,&amp;nbsp; so i can not correct the spelling by typing the right name&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;have to rely on the percentage of compged(Name1,Name2)&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;and the&amp;nbsp;Club no.&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks to all&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 15 Feb 2020 14:46:41 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/dealing-with-the-worst-data-set/m-p/625013#M184188</guid>
      <dc:creator>harrylui</dc:creator>
      <dc:date>2020-02-15T14:46:41Z</dc:date>
    </item>
    <item>
      <title>Re: dealing with the worst data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/dealing-with-the-worst-data-set/m-p/625025#M184193</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/107435"&gt;@harrylui&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am not sure you can do this easily without a kind of dictionary that specifies what are the 'true' values (like 'Manchester United).&lt;/P&gt;
&lt;P&gt;Maybe a simple PROC FREQ and then a PROC TRANSPOSE by Club_no could help identify these reference names. In the below code, the first column (name1) presents the occurence that has the higher frequency, so maybe the higher probability that it is the good spelling).&lt;/P&gt;
&lt;P&gt;But to answer your question, I am not sure you can avoid a manual step ...&lt;/P&gt;
&lt;P&gt;The first thing is really to have a better knowledge of your data so as to be able to recode them.&lt;/P&gt;
&lt;P&gt;Best,&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
	by Club_no;
run;

proc freq data=have noprint;
	table name / out=have_freq (drop=percent);
	by Club_no;
run;

proc sort data=have_freq;
	by Club_no descending count;
run;

proc transpose data=have_freq out=want (drop=_:) prefix=name;
	var name;
	by Club_no;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 15 Feb 2020 18:41:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/dealing-with-the-worst-data-set/m-p/625025#M184193</guid>
      <dc:creator>ed_sas_member</dc:creator>
      <dc:date>2020-02-15T18:41:46Z</dc:date>
    </item>
    <item>
      <title>Re: dealing with the worst data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/dealing-with-the-worst-data-set/m-p/625032#M184194</link>
      <description>&lt;P&gt;What if you considered the first 2 letters of every word? Would that simplify the problem?&lt;/P&gt;</description>
      <pubDate>Sat, 15 Feb 2020 19:32:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/dealing-with-the-worst-data-set/m-p/625032#M184194</guid>
      <dc:creator>PGStats</dc:creator>
      <dc:date>2020-02-15T19:32:43Z</dc:date>
    </item>
  </channel>
</rss>

