<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Removing duplicate names from a text field in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921619#M362933</link>
    <description>&lt;P&gt;Using 'proc sort nodupkey' on a single text field containing names is not removing duplicates.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is happening even after compressing the field to remove blanks, punctuation, diacritical marks, etc.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In other words, printing and visually examining the text field does not reveal any obvious differences in the duplicates such as minor spelling differences, case sensitivity, etc..&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Confirmation of this was made by passing the resulting text field into Excel and then reading that Excel file back into SAS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This extra step produces a text field from which all duplicate names can be stripped using 'proc sort nodupkey'.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;2952 data test;infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2952! missover;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2953 length DIRECTOR $50.;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2954 input director;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2955 run;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;NOTE: The infile 'c:\data\analyses\data\directors.txt' is:&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;Filename=c:\data\analyses\data\directors.txt,&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;RECFM=V,LRECL=1500,File Size (bytes)=751665,&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;Last Modified=24Mar2024:12:53:50,&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;Create Time=24Mar2024:12:53:50&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;NOTE: 46774 records were read from the infile 'c:\data\analyses\data\directors.txt'.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;The minimum record length was 1.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;The maximum record length was 40.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: The data set WORK.TEST has 46774 observations and 1 variables.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: DATA statement used (Total process time):&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;real time 0.04 seconds&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;cpu time 0.00 seconds&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;EM&gt;2956 proc sort nodupkey;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2957 by director;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2958 run;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;NOTE: There were 46774 observations read from the data set WORK.TEST.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: 1541 observations with duplicate key values were deleted.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: The data set WORK.TEST has 45233 observations and 1 variables.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: PROCEDURE SORT used (Total process time):&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;real time 0.01 seconds&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;cpu time 0.00 seconds&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Given that, the problem must be in how the underlying information was stored, e.g., hex vs ASCII vs EBCDIC, issues which are not a spike for me.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Obviously, I don't want to have to pass files back and forth between SAS and Excel.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question is,&amp;nbsp;How do I fix this text field in SAS?&lt;/P&gt;</description>
    <pubDate>Sun, 24 Mar 2024 06:22:27 GMT</pubDate>
    <dc:creator>xtc283x</dc:creator>
    <dc:date>2024-03-24T06:22:27Z</dc:date>
    <item>
      <title>Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921619#M362933</link>
      <description>&lt;P&gt;Using 'proc sort nodupkey' on a single text field containing names is not removing duplicates.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is happening even after compressing the field to remove blanks, punctuation, diacritical marks, etc.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In other words, printing and visually examining the text field does not reveal any obvious differences in the duplicates such as minor spelling differences, case sensitivity, etc..&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Confirmation of this was made by passing the resulting text field into Excel and then reading that Excel file back into SAS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This extra step produces a text field from which all duplicate names can be stripped using 'proc sort nodupkey'.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;2952 data test;infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2952! missover;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2953 length DIRECTOR $50.;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2954 input director;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2955 run;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;NOTE: The infile 'c:\data\analyses\data\directors.txt' is:&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;Filename=c:\data\analyses\data\directors.txt,&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;RECFM=V,LRECL=1500,File Size (bytes)=751665,&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;Last Modified=24Mar2024:12:53:50,&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;Create Time=24Mar2024:12:53:50&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;NOTE: 46774 records were read from the infile 'c:\data\analyses\data\directors.txt'.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;The minimum record length was 1.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;The maximum record length was 40.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: The data set WORK.TEST has 46774 observations and 1 variables.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: DATA statement used (Total process time):&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;real time 0.04 seconds&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;cpu time 0.00 seconds&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;EM&gt;2956 proc sort nodupkey;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2957 by director;&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;2958 run;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;NOTE: There were 46774 observations read from the data set WORK.TEST.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: 1541 observations with duplicate key values were deleted.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: The data set WORK.TEST has 45233 observations and 1 variables.&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;NOTE: PROCEDURE SORT used (Total process time):&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;real time 0.01 seconds&lt;/EM&gt;&lt;BR /&gt;&lt;EM&gt;cpu time 0.00 seconds&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Given that, the problem must be in how the underlying information was stored, e.g., hex vs ASCII vs EBCDIC, issues which are not a spike for me.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Obviously, I don't want to have to pass files back and forth between SAS and Excel.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question is,&amp;nbsp;How do I fix this text field in SAS?&lt;/P&gt;</description>
      <pubDate>Sun, 24 Mar 2024 06:22:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921619#M362933</guid>
      <dc:creator>xtc283x</dc:creator>
      <dc:date>2024-03-24T06:22:27Z</dc:date>
    </item>
    <item>
      <title>Re: Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921620#M362934</link>
      <description>&lt;P&gt;Add this to the initial DATA step which reads the data into SAS:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;director_h = put(director,$hex100.);&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;This will allow you to find hidden bytes which cause the deduplication to fail.&lt;/P&gt;</description>
      <pubDate>Sun, 24 Mar 2024 07:13:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921620#M362934</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2024-03-24T07:13:06Z</dc:date>
    </item>
    <item>
      <title>Re: Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921622#M362935</link>
      <description>&lt;P&gt;May-be try kcompress() to cover multibyte and remove any non-printable character. Or use the N modifier if you know it should only be&amp;nbsp;&lt;EM&gt;digits, the underscore character, and English letters&lt;/EM&gt;.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
  set have;
  if 0 then dedup_var=source_var;
  dedup_var=upcase(kcompress(source_var,' ','kw'));
/*  dedup_var=upcase(kcompress(source_var,' ','kn'));*/

run;

proc sort data=want nodupkey;
  by dedup_var;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;And if this still not helps then you need to look at the hex values as already proposed.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 24 Mar 2024 09:18:07 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921622#M362935</guid>
      <dc:creator>Patrick</dc:creator>
      <dc:date>2024-03-24T09:18:07Z</dc:date>
    </item>
    <item>
      <title>Re: Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921625#M362937</link>
      <description>&lt;P&gt;Are you sure you removed leading blanks?&lt;/P&gt;
&lt;P&gt;Examining in printed output typically hides the leading blanks.&lt;/P&gt;
&lt;P&gt;Example:&lt;/P&gt;
&lt;PRE&gt;data have;
    name= '   John';output;
    name='  John';output;
    name=' John';output;
    name='John';output;
run;&lt;/PRE&gt;
&lt;P&gt;If you use proc print ( tabulate or report) to display that then ODS results will suppress the leading blanks and appear as if the same.&lt;/P&gt;</description>
      <pubDate>Sun, 24 Mar 2024 11:22:00 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921625#M362937</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2024-03-24T11:22:00Z</dc:date>
    </item>
    <item>
      <title>Re: Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921628#M362939</link>
      <description>Ballardw-&lt;BR /&gt;Thank you for your response.&lt;BR /&gt;&lt;BR /&gt;The strip() function was the answer.&lt;BR /&gt;Best regards,xtc283x&lt;BR /&gt;</description>
      <pubDate>Sun, 24 Mar 2024 15:09:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921628#M362939</guid>
      <dc:creator>xtc283x</dc:creator>
      <dc:date>2024-03-24T15:09:06Z</dc:date>
    </item>
    <item>
      <title>Re: Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921631#M362942</link>
      <description>&lt;P&gt;Why did you read in the director variable in that strange way?&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data test;
  infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd missover;
  length DIRECTOR $50.;
  input director;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Are there other values on the lines of the file directors.txt after the first TAB?&amp;nbsp; Why didn't you read in those other values into other variables?&amp;nbsp; Why use the mainly supplanted MISSOVER option instead of the TRUNCOVER option? Are their short values at the end of the lines you want INPUT to ignore?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If each line has one directory name then just use normal INPUT with $ informat will automatically remove leading spaces on the lines.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data test;
  infile 'c:\data\analyses\data\directors.txt' truncover firstobs=2; 
  input DIRECTOR $50.;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;And if there are tab characters in the line and you want everything before the first TAB then add this after the INPUT statement.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;  if index(director,'09'x) then director=scan(director,1,'09'x);&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Or perhaps there are just random tab characters on the line?&amp;nbsp; Perhaps someone created the text file in a editor that replaces multiple spaces with tabs in an old fashioned attempt to save a few bytes of storage?&amp;nbsp; In that case you could try adding the EXPANDTABS option to the INFILE statement and SAS will replace the tabs with the number of spaces needed to move to the text tab stop.&lt;/P&gt;</description>
      <pubDate>Sun, 24 Mar 2024 15:49:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921631#M362942</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2024-03-24T15:49:52Z</dc:date>
    </item>
    <item>
      <title>Re: Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921632#M362943</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/36940"&gt;@xtc283x&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Ballardw-&lt;BR /&gt;Thank you for your response.&lt;BR /&gt;&lt;BR /&gt;The strip() function was the answer.&lt;BR /&gt;Best regards,xtc283x&lt;BR /&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Note that you don't need to remove the trailing spaces.&amp;nbsp; &amp;nbsp;SAS string comparisons will ignore those.&amp;nbsp; And they will just come back again anyway when you assign the value to a variable.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The LEFT() function will remove the leading spaces (it moves them to the end where they do no harm).&lt;/P&gt;</description>
      <pubDate>Sun, 24 Mar 2024 15:56:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921632#M362943</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2024-03-24T15:56:03Z</dc:date>
    </item>
    <item>
      <title>Re: Removing duplicate names from a text field</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921648#M362949</link>
      <description>&lt;P&gt;That input code was copied and pasted from the full programming code and is an approach developed over years of coding. It works for me.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your comments.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Mar 2024 00:07:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Removing-duplicate-names-from-a-text-field/m-p/921648#M362949</guid>
      <dc:creator>xtc283x</dc:creator>
      <dc:date>2024-03-25T00:07:02Z</dc:date>
    </item>
  </channel>
</rss>

