Using 'proc sort nodupkey' on a single text field containing names is not removing duplicates. This is happening even after compressing the field to remove blanks, punctuation, diacritical marks, etc. In other words, printing and visually examining the text field does not reveal any obvious differences in the duplicates such as minor spelling differences, case sensitivity, etc.. Confirmation of this was made by passing the resulting text field into Excel and then reading that Excel file back into SAS. This extra step produces a text field from which all duplicate names can be stripped using 'proc sort nodupkey'. 2952 data test;infile 'c:\data\analyses\data\directors.txt' lrecl=1500 firstobs=2 dlm='09'x dsd 2952! missover; 2953 length DIRECTOR $50.; 2954 input director; 2955 run; NOTE: The infile 'c:\data\analyses\data\directors.txt' is: Filename=c:\data\analyses\data\directors.txt, RECFM=V,LRECL=1500,File Size (bytes)=751665, Last Modified=24Mar2024:12:53:50, Create Time=24Mar2024:12:53:50 NOTE: 46774 records were read from the infile 'c:\data\analyses\data\directors.txt'. The minimum record length was 1. The maximum record length was 40. NOTE: The data set WORK.TEST has 46774 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.00 seconds 2956 proc sort nodupkey; 2957 by director; 2958 run; NOTE: There were 46774 observations read from the data set WORK.TEST. NOTE: 1541 observations with duplicate key values were deleted. NOTE: The data set WORK.TEST has 45233 observations and 1 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds Given that, the problem must be in how the underlying information was stored, e.g., hex vs ASCII vs EBCDIC, issues which are not a spike for me. Obviously, I don't want to have to pass files back and forth between SAS and Excel. My question is, How do I fix this text field in SAS?
... View more