BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
whymath
Lapis Lazuli | Level 10

I have a test csv file, coded in "utf-8", when I use "proc import" on this file, the length "TimePointDescription" would be 17 or 16, depends on SAS session encoding is "utf-8" or not. If I add statement "guessingrows=1024", the different would be gone. 

 

The test csv file and my usage of "proc import" is attached.

utf-8.jpg

wlatin1.jpg

euc-cn.jpg

 

My SAS Version is 9.4M8 and my questions is: What causes this difference? 

Thanks for any hint.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

PROC IMPORT is a GUESSING procedure. By changing the value of GUESSINGROWS option you change how much data it uses to make its guesses.  So if the longest value for TimePointDescription in the first 20 rows is only 16 bytes then by increasing it to 1024 rows you let it see an example where the length was greater then 16 bytes.  You have to check at least 97 rows to find one with a longer string in that column.

1689  data _null_;
1690    set want;
1691    if length(TimePointDescription) > 16 then do;
1692      put _n_= TimePointDescription= ;
1693      stop;
1694    end;
1695  run;

_N_=97 TimePointDescription=Week 17 - Day 113

 

UTF-8 is different than any single byte encoding in that it will sometimes use more than one byte to store a character.  So if any of the values of TimePointDescription included a non 7-bit character then in UTF-8 that character could take from 2 to 4 bytes to be represented instead of just one.  That could change the guess that PROC IMPORT makes on the length to use for the variable.

 

But the bigger problem with using PROC IMPORT to read that file is that it will select the wrong TYPE for two of the variables.  The variables  SiteID and RandomizedSubjectID should be defined as character variables.  You will never want to take the MEAN() of SITEID.

 

It also attaches the MMDDYY format to the date variables, which will confuse the 50% of your audience that is used to DDMMYY format.  Use YYMMDD or DATE to avoid confusion.

 

data want;
  infile 'c:\downloads\test_csv.csv' dsd truncover firstobs=2 ;
  length
    SiteID $5
    SiteName $100
    RandomizedSubjectID $9 
    RandomizedLaterality $1
    TimePointDescription $17
    ProcedureName $6
    StudyDate 8
    WFStep $9 
    LateralityReceived $1 
    UploadDate 8
    CompletedDate 8
    TotalQueriesPending 8
  ;
  informat StudyDate UploadDate CompletedDate mmddyy. ;
  format StudyDate UploadDate CompletedDate yymmdd10. ;
  input SiteID -- TotalQueriesPending ;
run;

 

View solution in original post

10 REPLIES 10
andreas_lds
Jade | Level 19

Contacting tech support is recommended, this could be a feature introduced with m8.

whymath
Lapis Lazuli | Level 10
I ask a friend to reproduce it in M7 and she made it. Could you please try it? I want to collect more info before contacting tech support.
whymath
Lapis Lazuli | Level 10
How to import a csv file if I don't plan to write variable names and lengths by hand? Could you please give me an example?
Kurt_Bremser
Super User

You can't expect to do work without work 😉

Either write the code yourself, or live with the guessing of PROC IMPORT and the many fixes you will have to make. Which, in the end, cost you more effort than the initial one-time coding would have taken.

In my more than two decades of professional SAS work, I've never used IMPORT at all. And I have written and maintained several hundred jobs reading external data.

BTW, with a little experience and practice, your code with just 12 variables is done in 10 minutes or less. The real time-eaters in programming are testing, deploying, and documenting.

 

whymath
Lapis Lazuli | Level 10
Thanks for your kindly advice.
ballardw
Super User

@whymath wrote:
How to import a csv file if I don't plan to write variable names and lengths by hand? Could you please give me an example?

The code generated by Proc Import can either be recalled to the editor (SAS Display Manager) or copied from the log, pasted into the editor and cleaned up. Then all you have to do is make sure the Informat actually matches data.

If you are going to read multiple files of the same structure you likely should pad any character variables that tend to have varying length text like names, description or any free-form text so that later files have a chance of reading with out truncating values. OR hope that the document provider will have documentation that says what the maximum length of any given value can be as well as things like date appearance and such.

The  change in the infile and data set names as needed with the code.

 

One of the truly obnoxious thing that many of the sub-optimal data sources that are common today are things like repeated column headings or multiple column headings that have common text for more than 32 characters. Proc import handles those poorly.

And just because someone provides a column heading like "this is an obnoxiously long column heading that will get truncated by proc import anyway" you don't have to use that as a variable name. Perhaps as a label but set a variable name that makes sense.

whymath
Lapis Lazuli | Level 10
Thank you @ballardw. I know how to write a data step to import csv file now. And your suggestion about importing multiple files of the same structure is very helpful.
Tom
Super User Tom
Super User

PROC IMPORT is a GUESSING procedure. By changing the value of GUESSINGROWS option you change how much data it uses to make its guesses.  So if the longest value for TimePointDescription in the first 20 rows is only 16 bytes then by increasing it to 1024 rows you let it see an example where the length was greater then 16 bytes.  You have to check at least 97 rows to find one with a longer string in that column.

1689  data _null_;
1690    set want;
1691    if length(TimePointDescription) > 16 then do;
1692      put _n_= TimePointDescription= ;
1693      stop;
1694    end;
1695  run;

_N_=97 TimePointDescription=Week 17 - Day 113

 

UTF-8 is different than any single byte encoding in that it will sometimes use more than one byte to store a character.  So if any of the values of TimePointDescription included a non 7-bit character then in UTF-8 that character could take from 2 to 4 bytes to be represented instead of just one.  That could change the guess that PROC IMPORT makes on the length to use for the variable.

 

But the bigger problem with using PROC IMPORT to read that file is that it will select the wrong TYPE for two of the variables.  The variables  SiteID and RandomizedSubjectID should be defined as character variables.  You will never want to take the MEAN() of SITEID.

 

It also attaches the MMDDYY format to the date variables, which will confuse the 50% of your audience that is used to DDMMYY format.  Use YYMMDD or DATE to avoid confusion.

 

data want;
  infile 'c:\downloads\test_csv.csv' dsd truncover firstobs=2 ;
  length
    SiteID $5
    SiteName $100
    RandomizedSubjectID $9 
    RandomizedLaterality $1
    TimePointDescription $17
    ProcedureName $6
    StudyDate 8
    WFStep $9 
    LateralityReceived $1 
    UploadDate 8
    CompletedDate 8
    TotalQueriesPending 8
  ;
  informat StudyDate UploadDate CompletedDate mmddyy. ;
  format StudyDate UploadDate CompletedDate yymmdd10. ;
  input SiteID -- TotalQueriesPending ;
run;

 

whymath
Lapis Lazuli | Level 10
@Tom, I think your explaination make sense. I'll check it with Tech Support. Your indication on proc import selecting wrong type just woke me up, I change to the data step way right now. Please accept my heartfelt thanks.

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 1373 views
  • 3 likes
  • 5 in conversation