BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Hendrik
Calcite | Level 5

Hi,

 

I have a program that reads data which has to be processed, the data consists of two columns (Jobs, length) with Jobs rows as follows = (Job1, Job2, ...., Job250).

 

SAS seems to falsely identifying duplicate rows (i.e. Job100 is a duplicate of Job 10). The following screenshot shows the line of code and duplicate message:

 

Capture.JPG

 

How do I specify the way duplicates are identified and where do I embed this in my code?

 

Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
SuryaKiran
Meteorite | Level 14

Hm, It would be better if you share your whole code on how your importing the data  and what your doing next. No one here knows your data more than you, we only can give suggestion on what we understand. 

 

Since you mentioned you are importing the data from Excel, you first need to understand that PROC IMPORT works on guessing data. It scans first 20 rows and determine the data types before importing. Your Job variable might have length of 5 for the first 20 rows (ie: values Job1-job20 ) so SAS imports data with length 5 for Job variable. Values where length>5 will be truncated during the import process and assigning greater length later will not work because already data was truncated.

 

Solution:

1) Try EXCEL LIBNAME (libname xl EXCEL 'D:\SASUniversityEdition\myfolders\Import_Data.xlsx'; ) and read the excel file as dataset.

2) Save your Excel file as CSV and then import using DATA STEP infile, this way you have more control over data types.

3) You can also use GUESSINGROWS option in PROC IMPORT if your data is in the form of raw data (csv,txt..)

4) Make sure your excel file is sorted in the order to have max length values in the first 20 rows.

5) SAS 9.4, and in the command line – in the upper left corner of the screen – type regedit. Then navigate to Products-->BASE-->EFI-->GuessingRows and set the value to the number of rows in the CSV or Excel file you want SAS to scan to find the longest value. 

image.png

Thanks,
Suryakiran

View solution in original post

7 REPLIES 7
SuryaKiran
Meteorite | Level 14

Are you sure your values in the dataset are not truncated to length of 5.

data test;
length var $5;
input var $;
datalines ;
job10
job1000
;
run;

In this example both the values will be "job10", since length is 5 SAS reads only first 5 characters.

Thanks,
Suryakiran
Hendrik
Calcite | Level 5

Thanks Suryakiran,

 

I first thought, that the variables are shortened to a default length of 5, but I changed the names and SAS still removes quite a lot. Here is the first part of my code until the error occurs:

 

data myContent;

set work.falk_u250test;

run;

 

proc print data=myContent(keep= Jobs length);

run;

proc optmodel;

/* read the product and size data */

set <str> PRODUCTS;

num length {PRODUCTS};

read data myContent into PRODUCTS=[Jobs] length;

 

[...]

 

quit;

 

 

Could it be some default setting of the 'optmodel'?

SuryaKiran
Meteorite | Level 14

How is your "work.falk_u250test" dataset created? Did you check the values in it.

 

proc print data=work.falk_u250test;

run;

If your dataset is very large that you can't print then pull only the records you want to validate.

 

proc print data=work.falk_u250test;

where job in ('job10','job1000');

run;

 

I'm not much aware of this procedure. 

 

Thanks,
Suryakiran
Hendrik
Calcite | Level 5

The Jobs column of the dataset is created in Excel with an autofill (drag down) option. So far, the only way to avoid the duplicate deletion is to insert random names for the jobs and work with that.

 

As I thought that maybe SAS automatically reduces the number of used signs to '5', I also tried to manually set it to '6' during the data import:

 

data myContent;

length var $6;

set work.falk_u250test;

run;

 

However, it does not work either

Tom
Super User Tom
Super User

You can use PROC SORT to check if there are duplicates.

proc sort data=work.falk_u250test out=sorted nodupkey ;
  by var ;
run;

Also make sure that you have not accidentally assigned a short FORMAT to the variable.  If you have a variable with a length of $10 but have assigned a format of only $5. to it then many procedures (like PROC FREQ) will use the FORMATTED value instead of the actual value.  This can result in duplicates.

SuryaKiran
Meteorite | Level 14

Hm, It would be better if you share your whole code on how your importing the data  and what your doing next. No one here knows your data more than you, we only can give suggestion on what we understand. 

 

Since you mentioned you are importing the data from Excel, you first need to understand that PROC IMPORT works on guessing data. It scans first 20 rows and determine the data types before importing. Your Job variable might have length of 5 for the first 20 rows (ie: values Job1-job20 ) so SAS imports data with length 5 for Job variable. Values where length>5 will be truncated during the import process and assigning greater length later will not work because already data was truncated.

 

Solution:

1) Try EXCEL LIBNAME (libname xl EXCEL 'D:\SASUniversityEdition\myfolders\Import_Data.xlsx'; ) and read the excel file as dataset.

2) Save your Excel file as CSV and then import using DATA STEP infile, this way you have more control over data types.

3) You can also use GUESSINGROWS option in PROC IMPORT if your data is in the form of raw data (csv,txt..)

4) Make sure your excel file is sorted in the order to have max length values in the first 20 rows.

5) SAS 9.4, and in the command line – in the upper left corner of the screen – type regedit. Then navigate to Products-->BASE-->EFI-->GuessingRows and set the value to the number of rows in the CSV or Excel file you want SAS to scan to find the longest value. 

image.png

Thanks,
Suryakiran
Hendrik
Calcite | Level 5

Thank you!

 

I did not know that SAS looks at the first 20 values to estimate the variables. In my case, the variable names were 'Job' + # which resulted in 6 instead of 5 characters for values from 100. If the variable length was set to $5 before, the last character was automatically cut of and hence, the duplicates were found and eliminated.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 2149 views
  • 0 likes
  • 3 in conversation