I have a test csv file, coded in "utf-8", when I use "proc import" on this file, the length "TimePointDescription" would be 17 or 16, depends on SAS session encoding is "utf-8" or not. If I add statement "guessingrows=1024", the different would be gone.
The test csv file and my usage of "proc import" is attached.
My SAS Version is 9.4M8 and my questions is: What causes this difference?
Thanks for any hint.
PROC IMPORT is a GUESSING procedure. By changing the value of GUESSINGROWS option you change how much data it uses to make its guesses. So if the longest value for TimePointDescription in the first 20 rows is only 16 bytes then by increasing it to 1024 rows you let it see an example where the length was greater then 16 bytes. You have to check at least 97 rows to find one with a longer string in that column.
1689 data _null_; 1690 set want; 1691 if length(TimePointDescription) > 16 then do; 1692 put _n_= TimePointDescription= ; 1693 stop; 1694 end; 1695 run; _N_=97 TimePointDescription=Week 17 - Day 113
UTF-8 is different than any single byte encoding in that it will sometimes use more than one byte to store a character. So if any of the values of TimePointDescription included a non 7-bit character then in UTF-8 that character could take from 2 to 4 bytes to be represented instead of just one. That could change the guess that PROC IMPORT makes on the length to use for the variable.
But the bigger problem with using PROC IMPORT to read that file is that it will select the wrong TYPE for two of the variables. The variables SiteID and RandomizedSubjectID should be defined as character variables. You will never want to take the MEAN() of SITEID.
It also attaches the MMDDYY format to the date variables, which will confuse the 50% of your audience that is used to DDMMYY format. Use YYMMDD or DATE to avoid confusion.
data want;
infile 'c:\downloads\test_csv.csv' dsd truncover firstobs=2 ;
length
SiteID $5
SiteName $100
RandomizedSubjectID $9
RandomizedLaterality $1
TimePointDescription $17
ProcedureName $6
StudyDate 8
WFStep $9
LateralityReceived $1
UploadDate 8
CompletedDate 8
TotalQueriesPending 8
;
informat StudyDate UploadDate CompletedDate mmddyy. ;
format StudyDate UploadDate CompletedDate yymmdd10. ;
input SiteID -- TotalQueriesPending ;
run;
Contacting tech support is recommended, this could be a feature introduced with m8.
NEVER use PROC IMPORT for a text file (at least not if you expect consistent, reliable results). Write the DATA step yourself and be done with it.
You can't expect to do work without work 😉
Either write the code yourself, or live with the guessing of PROC IMPORT and the many fixes you will have to make. Which, in the end, cost you more effort than the initial one-time coding would have taken.
In my more than two decades of professional SAS work, I've never used IMPORT at all. And I have written and maintained several hundred jobs reading external data.
BTW, with a little experience and practice, your code with just 12 variables is done in 10 minutes or less. The real time-eaters in programming are testing, deploying, and documenting.
@whymath wrote:
How to import a csv file if I don't plan to write variable names and lengths by hand? Could you please give me an example?
The code generated by Proc Import can either be recalled to the editor (SAS Display Manager) or copied from the log, pasted into the editor and cleaned up. Then all you have to do is make sure the Informat actually matches data.
If you are going to read multiple files of the same structure you likely should pad any character variables that tend to have varying length text like names, description or any free-form text so that later files have a chance of reading with out truncating values. OR hope that the document provider will have documentation that says what the maximum length of any given value can be as well as things like date appearance and such.
The change in the infile and data set names as needed with the code.
One of the truly obnoxious thing that many of the sub-optimal data sources that are common today are things like repeated column headings or multiple column headings that have common text for more than 32 characters. Proc import handles those poorly.
And just because someone provides a column heading like "this is an obnoxiously long column heading that will get truncated by proc import anyway" you don't have to use that as a variable name. Perhaps as a label but set a variable name that makes sense.
PROC IMPORT is a GUESSING procedure. By changing the value of GUESSINGROWS option you change how much data it uses to make its guesses. So if the longest value for TimePointDescription in the first 20 rows is only 16 bytes then by increasing it to 1024 rows you let it see an example where the length was greater then 16 bytes. You have to check at least 97 rows to find one with a longer string in that column.
1689 data _null_; 1690 set want; 1691 if length(TimePointDescription) > 16 then do; 1692 put _n_= TimePointDescription= ; 1693 stop; 1694 end; 1695 run; _N_=97 TimePointDescription=Week 17 - Day 113
UTF-8 is different than any single byte encoding in that it will sometimes use more than one byte to store a character. So if any of the values of TimePointDescription included a non 7-bit character then in UTF-8 that character could take from 2 to 4 bytes to be represented instead of just one. That could change the guess that PROC IMPORT makes on the length to use for the variable.
But the bigger problem with using PROC IMPORT to read that file is that it will select the wrong TYPE for two of the variables. The variables SiteID and RandomizedSubjectID should be defined as character variables. You will never want to take the MEAN() of SITEID.
It also attaches the MMDDYY format to the date variables, which will confuse the 50% of your audience that is used to DDMMYY format. Use YYMMDD or DATE to avoid confusion.
data want;
infile 'c:\downloads\test_csv.csv' dsd truncover firstobs=2 ;
length
SiteID $5
SiteName $100
RandomizedSubjectID $9
RandomizedLaterality $1
TimePointDescription $17
ProcedureName $6
StudyDate 8
WFStep $9
LateralityReceived $1
UploadDate 8
CompletedDate 8
TotalQueriesPending 8
;
informat StudyDate UploadDate CompletedDate mmddyy. ;
format StudyDate UploadDate CompletedDate yymmdd10. ;
input SiteID -- TotalQueriesPending ;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.