topic Re: Program removes duplicates that are no duplicates in Statistical Procedures

Program removes duplicates that are no duplicates

Hendrik — Wed, 11 Jul 2018 15:07:01 GMT

Hi,

I have a program that reads data which has to be processed, the data consists of two columns (Jobs, length) with Jobs rows as follows = (Job1, Job2, ...., Job250).

SAS seems to falsely identifying duplicate rows (i.e. Job100 is a duplicate of Job 10). The following screenshot shows the line of code and duplicate message:

How do I specify the way duplicates are identified and where do I embed this in my code?

Thank you in advance!

Re: Program removes duplicates that are no duplicates

SuryaKiran — Wed, 11 Jul 2018 16:11:57 GMT

Are you sure your values in the dataset are not truncated to length of 5.

data test;
length var $5;
input var $;
datalines ;
job10
job1000
;
run;

In this example both the values will be "job10", since length is 5 SAS reads only first 5 characters.

Re: Program removes duplicates that are no duplicates

Hendrik — Wed, 11 Jul 2018 16:49:10 GMT

Thanks Suryakiran,

I first thought, that the variables are shortened to a default length of 5, but I changed the names and SAS still removes quite a lot. Here is the first part of my code until the error occurs:

data myContent;

set work.falk_u250test;

run;

proc print data=myContent(keep= Jobs length);

run;

proc optmodel;

/* read the product and size data */

set <str> PRODUCTS;

num length {PRODUCTS};

read data myContent into PRODUCTS=[Jobs] length;

[...]

quit;

Could it be some default setting of the 'optmodel'?

Re: Program removes duplicates that are no duplicates

SuryaKiran — Wed, 11 Jul 2018 18:22:05 GMT

How is your "work.falk_u250test" dataset created? Did you check the values in it.

proc print data=work.falk_u250test;

run;

If your dataset is very large that you can't print then pull only the records you want to validate.

proc print data=work.falk_u250test;

where job in ('job10','job1000');

run;

I'm not much aware of this procedure.

Re: Program removes duplicates that are no duplicates

Hendrik — Wed, 11 Jul 2018 21:55:04 GMT

The Jobs column of the dataset is created in Excel with an autofill (drag down) option. So far, the only way to avoid the duplicate deletion is to insert random names for the jobs and work with that.

As I thought that maybe SAS automatically reduces the number of used signs to '5', I also tried to manually set it to '6' during the data import:

data myContent;

length var $6;

set work.falk_u250test;

run;

However, it does not work either

Re: Program removes duplicates that are no duplicates

Tom — Wed, 11 Jul 2018 23:25:52 GMT

You can use PROC SORT to check if there are duplicates.

proc sort data=work.falk_u250test out=sorted nodupkey ;
  by var ;
run;

Also make sure that you have not accidentally assigned a short FORMAT to the variable. If you have a variable with a length of $10 but have assigned a format of only $5. to it then many procedures (like PROC FREQ) will use the FORMATTED value instead of the actual value. This can result in duplicates.

Re: Program removes duplicates that are no duplicates

SuryaKiran — Wed, 11 Jul 2018 23:50:10 GMT

Hm, It would be better if you share your whole code on how your importing the data and what your doing next. No one here knows your data more than you, we only can give suggestion on what we understand.

Since you mentioned you are importing the data from Excel, you first need to understand that PROC IMPORT works on guessing data. It scans first 20 rows and determine the data types before importing. Your Job variable might have length of 5 for the first 20 rows (ie: values Job1-job20 ) so SAS imports data with length 5 for Job variable. Values where length>5 will be truncated during the import process and assigning greater length later will not work because already data was truncated.

Solution:

1) Try EXCEL LIBNAME (libname xl EXCEL 'D:\SASUniversityEdition\myfolders\Import_Data.xlsx'; ) and read the excel file as dataset.

2) Save your Excel file as CSV and then import using DATA STEP infile, this way you have more control over data types.

3) You can also use GUESSINGROWS option in PROC IMPORT if your data is in the form of raw data (csv,txt..)

4) Make sure your excel file is sorted in the order to have max length values in the first 20 rows.

5) SAS 9.4, and in the command line – in the upper left corner of the screen – type regedit. Then navigate to Products-->BASE-->EFI-->GuessingRows and set the value to the number of rows in the CSV or Excel file you want SAS to scan to find the longest value.

Re: Program removes duplicates that are no duplicates

Hendrik — Thu, 12 Jul 2018 12:46:03 GMT

Thank you!

I did not know that SAS looks at the first 20 values to estimate the variables. In my case, the variable names were 'Job' + # which resulted in 6 instead of 5 characters for values from 100. If the variable length was set to $5 before, the last character was automatically cut of and hence, the duplicates were found and eliminated.