BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
xtc283x
Quartz | Level 8

Using 'proc sort nodupkey' on a single text field containing names is not removing duplicates.

 

This is happening even after compressing the field to remove blanks, punctuation, diacritical marks, etc.

 

In other words, printing and visually examining the text field does not reveal any obvious differences in the duplicates such as minor spelling differences, case sensitivity, etc..

 

Confirmation of this was made by passing the resulting text field into Excel and then reading that Excel file back into SAS.

 

This extra step produces a text field from which all duplicate names can be stripped using 'proc sort nodupkey'.

 

2952 data test;infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd
2952! missover;
2953 length DIRECTOR $50.;
2954 input director;
2955 run;

NOTE: The infile 'c:\data\analyses\data\directors.txt' is:
Filename=c:\data\analyses\data\directors.txt,
RECFM=V,LRECL=1500,File Size (bytes)=751665,
Last Modified=24Mar2024:12:53:50,
Create Time=24Mar2024:12:53:50

NOTE: 46774 records were read from the infile 'c:\data\analyses\data\directors.txt'.
The minimum record length was 1.
The maximum record length was 40.
NOTE: The data set WORK.TEST has 46774 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds


2956 proc sort nodupkey;
2957 by director;
2958 run;

NOTE: There were 46774 observations read from the data set WORK.TEST.
NOTE: 1541 observations with duplicate key values were deleted.
NOTE: The data set WORK.TEST has 45233 observations and 1 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds

 

Given that, the problem must be in how the underlying information was stored, e.g., hex vs ASCII vs EBCDIC, issues which are not a spike for me.

 

Obviously, I don't want to have to pass files back and forth between SAS and Excel.

 

My question is, How do I fix this text field in SAS?

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

Are you sure you removed leading blanks?

Examining in printed output typically hides the leading blanks.

Example:

data have;
    name= '   John';output;
    name='  John';output;
    name=' John';output;
    name='John';output;
run;

If you use proc print ( tabulate or report) to display that then ODS results will suppress the leading blanks and appear as if the same.

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

Add this to the initial DATA step which reads the data into SAS:

director_h = put(director,$hex100.);

This will allow you to find hidden bytes which cause the deduplication to fail.

Patrick
Opal | Level 21

May-be try kcompress() to cover multibyte and remove any non-printable character. Or use the N modifier if you know it should only be digits, the underscore character, and English letters.

data want;
  set have;
  if 0 then dedup_var=source_var;
  dedup_var=upcase(kcompress(source_var,' ','kw'));
/*  dedup_var=upcase(kcompress(source_var,' ','kn'));*/

run;

proc sort data=want nodupkey;
  by dedup_var;
run;

And if this still not helps then you need to look at the hex values as already proposed.

 

ballardw
Super User

Are you sure you removed leading blanks?

Examining in printed output typically hides the leading blanks.

Example:

data have;
    name= '   John';output;
    name='  John';output;
    name=' John';output;
    name='John';output;
run;

If you use proc print ( tabulate or report) to display that then ODS results will suppress the leading blanks and appear as if the same.

xtc283x
Quartz | Level 8
Ballardw-
Thank you for your response.

The strip() function was the answer.
Best regards,xtc283x
Tom
Super User Tom
Super User

@xtc283x wrote:
Ballardw-
Thank you for your response.

The strip() function was the answer.
Best regards,xtc283x

Note that you don't need to remove the trailing spaces.   SAS string comparisons will ignore those.  And they will just come back again anyway when you assign the value to a variable.

 

The LEFT() function will remove the leading spaces (it moves them to the end where they do no harm).

Tom
Super User Tom
Super User

Why did you read in the director variable in that strange way?

data test;
  infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd missover;
  length DIRECTOR $50.;
  input director;
run;

Are there other values on the lines of the file directors.txt after the first TAB?  Why didn't you read in those other values into other variables?  Why use the mainly supplanted MISSOVER option instead of the TRUNCOVER option? Are their short values at the end of the lines you want INPUT to ignore?

 

If each line has one directory name then just use normal INPUT with $ informat will automatically remove leading spaces on the lines.

data test;
  infile 'c:\data\analyses\data\directors.txt' truncover firstobs=2; 
  input DIRECTOR $50.;
run;

And if there are tab characters in the line and you want everything before the first TAB then add this after the INPUT statement.

  if index(director,'09'x) then director=scan(director,1,'09'x);

Or perhaps there are just random tab characters on the line?  Perhaps someone created the text file in a editor that replaces multiple spaces with tabs in an old fashioned attempt to save a few bytes of storage?  In that case you could try adding the EXPANDTABS option to the INFILE statement and SAS will replace the tabs with the number of spaces needed to move to the text tab stop.

xtc283x
Quartz | Level 8

That input code was copied and pasted from the full programming code and is an approach developed over years of coding. It works for me.

 

Thanks for your comments.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 520 views
  • 5 likes
  • 5 in conversation