BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Phil_NZ
Barite | Level 11

Hi all SAS users, 

 

Today I read a document and see something still ambiguous to me:

1. In the paper, they said:

Problem: Data values were lost. Once the variable type is established as numeric, any values that are not strictly
numeric and that appear after the GUESSINGROWS limit are set to missing because they do not convert to valid
numeric values. This is a loss of information 

So, can you give me an example about "values that are not strictly numeric"  in this case, please?

 

2. This paragraph

Problem: Character variable should have been numeric. What should be character variables may be numeric
because character values appeared after the records covered by GUESSINGROWS

I am wondering if they mistyped, it should be "What should be numeric variables may be character because character values appeared after the records covered by GUESSINGROWS" from my point of view.

3. This paragraph

Problem: Character variable values were truncated. Character variables may not have the appropriate length and
result in truncated values since values of a greater length may appear after the records covered by
GUESSINGROWS.

When I read this paragraph, I feel strange as, why the authors only talk about the character variables. I think not only character but numeric variable also been truncated if one does not set GUESSINGROW=MAX in PROC IMPORT.

 

4. Problem: Character variable should have been numeric

 

The paragraph

If you have numeric variables in the file that happened to be blank in the file, they will become character variables.
Later, you could run into a problem for those variables (e.g., PROC MEANS analysis variables in the VAR statement
must be numeric variables). What you can do is run a check on variables that have to be numbers and convert them
into numbers.

I have an excel file, when using PROC IMPORT, even in my excel file, the blank appears in my file but the type of variable I input still show as numeric. I really concern about that part because it is what is inside my dataset. I did not fully understand what the authors really want to imply.

 

Many thanks, warm regards, and have a fruitful week,

Phil.

 

 

 

 

 

 

 

 

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

"So, can you give me an example about "values that are not strictly numeric" in this case, please?"

Lots of these:

ABC

123-456

A2345

Depending on order of values in a data set currency, percentages or values with commas may be treated as text. If the first rows of the data set get SAS to set an informat like  32. (or best32. or F32. all the same) then $123,456.12 will be missing as it does not fit the informat SAS has already chosen for the column.

 

"I am wondering if they mistyped, it should be "What should be numeric variables may be character because character values appeared after the records covered by GUESSINGROWS" from my point of view." not quite. The informat selected will set the variable type. If the informat selected is numeric then the "character" appearing values will be missing in the data.

If the variable was supposed to be character then the Guessingrows was too small. Key point: once the variable type is set, it is set.

 

Another issue is if you are reading multiple files. One file that has a variable of one type and is expected to be character but the next of the same type of content may not have any values of strictly character appearance. So the guess made by proc import for that file is numeric when you think it should be character. Remember, Proc IMPORT makes different guesses separately for each an every file. Which is why Proc Import is pretty much limited to prototyping in production code.

 

"When I read this paragraph, I feel strange as, why the authors only talk about the character variables. I think not only character but numeric variable also been truncated if one does not set GUESSINGROW=MAX in PROC IMPORT."

If the values in the first 20 rows are 10 characters in length then the default behavior for Proc Import is to set the length to 10. So any character variable longer than 10 will be truncated to 10 characters. You are again confusing "length" with "number of digits" for numeric variables. Have you looked at any of the data step code generated by proc import for CSV of similar files? There is data step code where you can directly read which informat was set, typically BEST32. So the only truncation that may occur is if your numeric variable has more than 32 characters (including the decimal portion). I will say that in 30+ years of data reading that I have never been presented with a numeric value of 32 characters.

 

4. Problem: Character variable should have been numeric

Typically this is caused with sparse data. If the first n rows(up to guessingrow value) are missing the value then SAS will set the variable type as character of length1. Which means the numeric value that appears in row n+1 is restricted to single character.

 

Excel and behavior is a moving target. Microsoft changes the engines used to read the data.

File formats change. (XLSX is zipped XML, XLS is binary).

Sometimes people will set properties for an entire column such as numeric, sometimes it is just "general" when you look at cell properties. In the XML text there will be tags for numeric variables. Not so much for the "general" when not occupied.

 

Moral of the story: If you are going to read 2 or more files of specific structure you likely should not trust Proc Import.

In a typical week I convert 10 to 15 Excel files to CSV so I can read them with a data step. For projects that have repeated files for text, fixed column, comma separated, tab separated, pipe separated, I have a data step to read the values.

Quite often including some code to validate expected values, i.e. 6 expected codes for Insurance.

 

I do use Proc Import with the first CSV or delimited text once to get the generated data step. Then I modify the variable names (way to often names are garbage or for variables I know I am going to use often I am not going to type "client_date_of_birth_1" but replace with "DOB") check the informats, which Proc import uses to set variable lengths, and adjust to match file documentation (any provided) or expected topics (if you get a personal name field that comes in as $8 for example make sure it is much longer like $20 which is still to short for Cartwright-Chickering and start of one of the earliest books about computer sabotage. This is the time to verify which columns are numeric and may require a currency informat. Check any date, time or datetime related (from Excel datetimes can be a headache).

View solution in original post

2 REPLIES 2
ballardw
Super User

"So, can you give me an example about "values that are not strictly numeric" in this case, please?"

Lots of these:

ABC

123-456

A2345

Depending on order of values in a data set currency, percentages or values with commas may be treated as text. If the first rows of the data set get SAS to set an informat like  32. (or best32. or F32. all the same) then $123,456.12 will be missing as it does not fit the informat SAS has already chosen for the column.

 

"I am wondering if they mistyped, it should be "What should be numeric variables may be character because character values appeared after the records covered by GUESSINGROWS" from my point of view." not quite. The informat selected will set the variable type. If the informat selected is numeric then the "character" appearing values will be missing in the data.

If the variable was supposed to be character then the Guessingrows was too small. Key point: once the variable type is set, it is set.

 

Another issue is if you are reading multiple files. One file that has a variable of one type and is expected to be character but the next of the same type of content may not have any values of strictly character appearance. So the guess made by proc import for that file is numeric when you think it should be character. Remember, Proc IMPORT makes different guesses separately for each an every file. Which is why Proc Import is pretty much limited to prototyping in production code.

 

"When I read this paragraph, I feel strange as, why the authors only talk about the character variables. I think not only character but numeric variable also been truncated if one does not set GUESSINGROW=MAX in PROC IMPORT."

If the values in the first 20 rows are 10 characters in length then the default behavior for Proc Import is to set the length to 10. So any character variable longer than 10 will be truncated to 10 characters. You are again confusing "length" with "number of digits" for numeric variables. Have you looked at any of the data step code generated by proc import for CSV of similar files? There is data step code where you can directly read which informat was set, typically BEST32. So the only truncation that may occur is if your numeric variable has more than 32 characters (including the decimal portion). I will say that in 30+ years of data reading that I have never been presented with a numeric value of 32 characters.

 

4. Problem: Character variable should have been numeric

Typically this is caused with sparse data. If the first n rows(up to guessingrow value) are missing the value then SAS will set the variable type as character of length1. Which means the numeric value that appears in row n+1 is restricted to single character.

 

Excel and behavior is a moving target. Microsoft changes the engines used to read the data.

File formats change. (XLSX is zipped XML, XLS is binary).

Sometimes people will set properties for an entire column such as numeric, sometimes it is just "general" when you look at cell properties. In the XML text there will be tags for numeric variables. Not so much for the "general" when not occupied.

 

Moral of the story: If you are going to read 2 or more files of specific structure you likely should not trust Proc Import.

In a typical week I convert 10 to 15 Excel files to CSV so I can read them with a data step. For projects that have repeated files for text, fixed column, comma separated, tab separated, pipe separated, I have a data step to read the values.

Quite often including some code to validate expected values, i.e. 6 expected codes for Insurance.

 

I do use Proc Import with the first CSV or delimited text once to get the generated data step. Then I modify the variable names (way to often names are garbage or for variables I know I am going to use often I am not going to type "client_date_of_birth_1" but replace with "DOB") check the informats, which Proc import uses to set variable lengths, and adjust to match file documentation (any provided) or expected topics (if you get a personal name field that comes in as $8 for example make sure it is much longer like $20 which is still to short for Cartwright-Chickering and start of one of the earliest books about computer sabotage. This is the time to verify which columns are numeric and may require a currency informat. Check any date, time or datetime related (from Excel datetimes can be a headache).

Phil_NZ
Barite | Level 11

Thank you @ballardw 

A very dedicated and detailed explanation!

 

Warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 535 views
  • 1 like
  • 2 in conversation