topic Removing duplicate names from a text field in SAS Programming

Removing duplicate names from a text field

xtc283x — Sun, 24 Mar 2024 06:22:27 GMT

Using 'proc sort nodupkey' on a single text field containing names is not removing duplicates.

This is happening even after compressing the field to remove blanks, punctuation, diacritical marks, etc.

In other words, printing and visually examining the text field does not reveal any obvious differences in the duplicates such as minor spelling differences, case sensitivity, etc..

Confirmation of this was made by passing the resulting text field into Excel and then reading that Excel file back into SAS.

This extra step produces a text field from which all duplicate names can be stripped using 'proc sort nodupkey'.

2952 data test;infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd
2952! missover;
2953 length DIRECTOR $50.;
2954 input director;
2955 run;

NOTE: The infile 'c:\data\analyses\data\directors.txt' is:
Filename=c:\data\analyses\data\directors.txt,
RECFM=V,LRECL=1500,File Size (bytes)=751665,
Last Modified=24Mar2024:12:53:50,
Create Time=24Mar2024:12:53:50

NOTE: 46774 records were read from the infile 'c:\data\analyses\data\directors.txt'.
The minimum record length was 1.
The maximum record length was 40.
NOTE: The data set WORK.TEST has 46774 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds

2956 proc sort nodupkey;
2957 by director;
2958 run;

NOTE: There were 46774 observations read from the data set WORK.TEST.
NOTE: 1541 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 45233 observations and 1 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

Given that, the problem must be in how the underlying information was stored, e.g., hex vs ASCII vs EBCDIC, issues which are not a spike for me.

Obviously, I don't want to have to pass files back and forth between SAS and Excel.

My question is, How do I fix this text field in SAS?

Re: Removing duplicate names from a text field

Kurt_Bremser — Sun, 24 Mar 2024 07:13:06 GMT

Add this to the initial DATA step which reads the data into SAS:

director_h = put(director,$hex100.);

This will allow you to find hidden bytes which cause the deduplication to fail.

Re: Removing duplicate names from a text field

Patrick — Sun, 24 Mar 2024 09:18:07 GMT

May-be try kcompress() to cover multibyte and remove any non-printable character. Or use the N modifier if you know it should only be digits, the underscore character, and English letters.

data want;
  set have;
  if 0 then dedup_var=source_var;
  dedup_var=upcase(kcompress(source_var,' ','kw'));
/*  dedup_var=upcase(kcompress(source_var,' ','kn'));*/

run;

proc sort data=want nodupkey;
  by dedup_var;
run;

And if this still not helps then you need to look at the hex values as already proposed.

Re: Removing duplicate names from a text field

ballardw — Sun, 24 Mar 2024 11:22:00 GMT

Are you sure you removed leading blanks?

Examining in printed output typically hides the leading blanks.

Example:

data have;
    name= '   John';output;
    name='  John';output;
    name=' John';output;
    name='John';output;
run;

If you use proc print ( tabulate or report) to display that then ODS results will suppress the leading blanks and appear as if the same.

Re: Removing duplicate names from a text field

xtc283x — Sun, 24 Mar 2024 15:09:06 GMT

Ballardw-
Thank you for your response.

The strip() function was the answer.
Best regards,xtc283x

Re: Removing duplicate names from a text field

Tom — Sun, 24 Mar 2024 15:49:52 GMT

Why did you read in the director variable in that strange way?

data test;
  infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd missover;
  length DIRECTOR $50.;
  input director;
run;

Are there other values on the lines of the file directors.txt after the first TAB? Why didn't you read in those other values into other variables? Why use the mainly supplanted MISSOVER option instead of the TRUNCOVER option? Are their short values at the end of the lines you want INPUT to ignore?

If each line has one directory name then just use normal INPUT with $ informat will automatically remove leading spaces on the lines.

data test;
  infile 'c:\data\analyses\data\directors.txt' truncover firstobs=2; 
  input DIRECTOR $50.;
run;

And if there are tab characters in the line and you want everything before the first TAB then add this after the INPUT statement.

  if index(director,'09'x) then director=scan(director,1,'09'x);

Or perhaps there are just random tab characters on the line? Perhaps someone created the text file in a editor that replaces multiple spaces with tabs in an old fashioned attempt to save a few bytes of storage? In that case you could try adding the EXPANDTABS option to the INFILE statement and SAS will replace the tabs with the number of spaces needed to move to the text tab stop.

Re: Removing duplicate names from a text field

Tom — Sun, 24 Mar 2024 15:56:03 GMT

@xtc283x wrote:
Ballardw-
Thank you for your response.

The strip() function was the answer.
Best regards,xtc283x

Note that you don't need to remove the trailing spaces. SAS string comparisons will ignore those. And they will just come back again anyway when you assign the value to a variable.

The LEFT() function will remove the leading spaces (it moves them to the end where they do no harm).

Re: Removing duplicate names from a text field

xtc283x — Mon, 25 Mar 2024 00:07:02 GMT

That input code was copied and pasted from the full programming code and is an approach developed over years of coding. It works for me.

Thanks for your comments.