Hi,
I was trying to import a csv file. A sample data is attached. The code I used is
data want; infile "&path\sample.csv" delimiter = ',' missover firstobs = 2 DSD lrecl = 32767; format IP $15.; format date yymmdd10.; format time hhmm.; format zone $5.; format cik $7.; format accession $20.; format extension $30.; format code $3.; format size 10.; input IP $ date time zone $ cik $ accession $ extension $ code $ size ; run;
In the log, I got a lot of "NOTE: Invalid data for time in line 21 28-35."
From the imported dataset, there are two issues I can tell. The first is that, date and time variables are always missing. The second is, for variable cik, some takes values like "95521.0" or "873860.", while my intention was to keep it purely as a character.
A more general question is, how to perfectly import such csv files with data step? Thanks!
The main thing is that if you want to read character strings into date and time variables then you need to tell SAS what INFORMAT to use when reading it.
Also don't use the FORMAT statement as it it is intended to DEFINE the variable's type and length. SAS will use the format specification you attach to a variable to make a GUESS about how to the define the variable, if the first place you reference the variable name is in the FORMAT statement, but that is not the purpose of a FORMAT statement. The purpose of the FORMAT statement is tell SAS how to display the value of the variable. So for DATE and TIME (and DATETIME) variables this is very important as the raw number of days or number of seconds is not that easy for humans to understand. But for most variables there is no need to attach either a FORMAT or an INFORMAT. SAS already knows how to read and write numbers and strings.
data want;
infile "&path\sample.csv" dsd firstobs=2 truncover ;
length IP $15 date 8 time 8 zone $5 cik $7 accession $20 extension $30 code $3 size 8;
format date yymmdd10. time time.;
informat date mmddyy. time time.;
input IP -- size ;
run;
Note that SAS will normally use BEST12. to display numbers when no explicit format is specified so there is not much value in attaching the F10. format to the SIZE variable. But you might want to consider attaching the COMMA13. format to it if you have large numbers to make them easier for humans to read.
Your input statement requires bit of work.
You are trying to code a simple list input just by specifying datatypes of the variables and the type of delimiter. What you need is a modified list input with : (colon) followed by the informat to exactly read nonstandard data such as date and time. Alternatively, if modified list input is confusing , specify informats in an informat statement before input statement.
hmm I just noticed your format statement. Make all those as informats in an informat statement and retype the format statement as is after input statement. HTH
data want; infile "&path\sample.csv" delimiter = ',' missover firstobs = 2 DSD lrecl = 32767;
informat IP $15.
date yymmdd10.
time hhmm.
zone $5.
cik $7.
accession $20.
extension $30.
code $3.
size 10.; format IP $15.; format date yymmdd10.; format time hhmm.; format zone $5.; format cik $7.; format accession $20.; format extension $30.; format code $3.; format size 10.; input IP $ date time zone $ cik $ accession $ extension $ code $ size ; run;
just do a small and simple manipulation like:
cik-=ifc(index(cik,'.')>0, substr(cik,1,index(cik,'.')-1), cik);
If it’s being read in like that it’s very likely those characters are in your text file. Verify your source data.
In just testing your sample.csv file, I saw that the wrong informats were suggested for both date and time. I doubt if the following will correct your other problem, but it should correct date and time. As for your problem with the cik variable, can you post a new sample.csv file that contains at least one record that results in the problem you described?
data want; infile "&path./sample.csv" delimiter = ',' missover firstobs = 2 DSD lrecl = 32767; informat IP $15. date mmddyy10. time time8. zone $5. cik $7. accession $20. extension $30. code $3. size 10.; format IP $15.; format date yymmdd10.; format time time8.; format zone $5.; format cik $7.; format accession $20.; format extension $30.; format code $3.; format size 10.; input IP date time zone cik accession extension code size ; run;
Art, CEO, AnalystFinder.com
@xyxu wrote:
I checked the csv file and it is indeed that way. For example, there is an observation with "cik = 7789" in the csv file, while the corresponding observation in the imported dataset has "cik = 7789.0"
Did you look at your file with a spreadsheet such as Excel or a text editor such as NotePad, WordPad, even the SAS editor, or similar? Excel will by default not show the .0 because it is being "helpful" as in "you don't need to see no stinking decimals with 0".
NEVER trust a spreadsheet to display a file created as CSV to actually show the contents. There are lots of appearance reformatting things that can happen with values that just happen to look numeric but should be character.
@xyxu wrote:
A more general question is, how to perfectly import such csv files with data step? Thanks!
1. First import the file with PROC IMPORT and set GUESSINGROWS=MAX (except if your file is millions of rows, then may use 1000000).
2. Check the log. The code in the log will be pretty darn close to what you want.
3. Copy the code from the log to an editor, modify as necessary.
4. Confirm against a record layout file and/or verification of the data via the highly tested eyeball method.
Usually when someone sends or creates massive sets of files they’ll also create documentation. One of the common items included in the documentation is the record layout which indicates variable types and length so a user doesn’t have to guess at the types or where a variable starts or ends.
Then, as I stated earlier, set it to 1 million. If all the files have the same layout you only need to do this once and use the same data step code to read all.
@xyxu wrote:
Thanks for the reply. Unfortunately I do have millions of rows for each csv file. I tried GUESSINGROWS=MAX and it takes forever to run the code.
What does it mean by "Confirm against a record layout file"?
The main thing is that if you want to read character strings into date and time variables then you need to tell SAS what INFORMAT to use when reading it.
Also don't use the FORMAT statement as it it is intended to DEFINE the variable's type and length. SAS will use the format specification you attach to a variable to make a GUESS about how to the define the variable, if the first place you reference the variable name is in the FORMAT statement, but that is not the purpose of a FORMAT statement. The purpose of the FORMAT statement is tell SAS how to display the value of the variable. So for DATE and TIME (and DATETIME) variables this is very important as the raw number of days or number of seconds is not that easy for humans to understand. But for most variables there is no need to attach either a FORMAT or an INFORMAT. SAS already knows how to read and write numbers and strings.
data want;
infile "&path\sample.csv" dsd firstobs=2 truncover ;
length IP $15 date 8 time 8 zone $5 cik $7 accession $20 extension $30 code $3 size 8;
format date yymmdd10. time time.;
informat date mmddyy. time time.;
input IP -- size ;
run;
Note that SAS will normally use BEST12. to display numbers when no explicit format is specified so there is not much value in attaching the F10. format to the SIZE variable. But you might want to consider attaching the COMMA13. format to it if you have large numbers to make them easier for humans to read.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.