BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
xyxu
Quartz | Level 8

Hi,

 

I was trying to import a csv file. A sample data is attached. The code I used is

 

data want;
	infile "&path\sample.csv"
    	delimiter = ','
    	missover
    	firstobs = 2
    	DSD 
    	lrecl = 32767;
	format IP $15.;
	format date yymmdd10.;
	format time hhmm.;
	format zone $5.;
	format cik $7.;
	format accession $20.;
	format extension $30.;
	format code $3.;
	format size 10.;
	input 
		IP $
		date 
		time 
		zone $
		cik $
		accession $
		extension $
		code $
		size
	;
run;

In the log, I got a lot of "NOTE: Invalid data for time in line 21 28-35."


From the imported dataset, there are two issues I can tell. The first is that, date and time variables are always missing. The second is, for variable cik, some takes values like "95521.0" or "873860.", while my intention was to keep it purely as a character.  

 

A more general question is, how to perfectly import such csv files with data step? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

The main thing is that if you want to read character strings into date and time variables then you need to tell SAS what INFORMAT to use when reading it.

 

Also don't use the FORMAT statement as it it is intended to DEFINE the variable's type and length.  SAS will use the format specification you attach to a variable to make a GUESS about how to the define the variable, if the first place you reference the variable name is in the FORMAT statement, but that is not the purpose of a FORMAT statement. The purpose of the FORMAT statement is tell SAS how to display the value of the variable. So for DATE and TIME (and DATETIME) variables this is very important as the raw number of days or number of seconds is not that easy for humans to understand.  But for most variables there is no need to attach either a FORMAT or an INFORMAT.  SAS already knows how to read and write numbers and strings.  

 

data want;
  infile "&path\sample.csv" dsd firstobs=2 truncover ;
  length IP $15 date 8 time 8 zone $5 cik $7 accession $20 extension $30 code $3 size 8;
  format date yymmdd10. time time.;
  informat date mmddyy. time time.;
  input IP -- size ;
run;

Note that SAS will normally use BEST12. to display numbers when no explicit format is specified so there is not much value in attaching the F10. format to the SIZE variable.  But you might want to consider attaching the COMMA13. format to it if you have large numbers to make them easier for humans to read.

View solution in original post

13 REPLIES 13
novinosrin
Tourmaline | Level 20

Your input statement requires bit of work. 

 

You are trying to code a simple list input just by specifying datatypes of the variables and the type of delimiter. What you need is a modified list input with : (colon) followed by the informat to exactly read nonstandard data such as date and time. Alternatively, if modified list input is confusing , specify informats in an informat statement before input statement.  

 

hmm I just noticed your format statement. Make all those as informats in an informat statement and retype the format statement as is after input statement. HTH

 

 

data want;
	infile "&path\sample.csv"
    	delimiter = ','
    	missover
    	firstobs = 2
    	DSD 
    	lrecl = 32767;
informat IP $15.
date yymmdd10.
time hhmm.
zone $5.
cik $7.
accession $20.
extension $30.
code $3.
size 10.; format IP $15.; format date yymmdd10.; format time hhmm.; format zone $5.; format cik $7.; format accession $20.; format extension $30.; format code $3.; format size 10.; input IP $ date time zone $ cik $ accession $ extension $ code $ size ; run;
xyxu
Quartz | Level 8
Thanks! My remaining issue is the format of variable "cik". It is now character, which is correct, but for many observations, there are decimal points. How can I get rid of these weird results?
novinosrin
Tourmaline | Level 20
What do you mean by weird decimal points once you have read and stored them as character with a $7. informat? do you wanna manipulate and store only the numbers as chars that occur before decimal points?
xyxu
Quartz | Level 8
By weird decimal points I mean values like "95521.0" or "873860.". Yes, I want to keep these char type numbers as only integers, since they are identifiers for each firm. I thought that by restricting them to be chars, I would have only strings like "12345678" and "12345", instead of "12345.0".
novinosrin
Tourmaline | Level 20

just do a small and simple manipulation like:

 

cik-=ifc(index(cik,'.')>0, substr(cik,1,index(cik,'.')-1), cik);

Reeza
Super User

If it’s being read in like that it’s very likely those characters are in your text file. Verify your source data. 

xyxu
Quartz | Level 8
I checked the csv file and it is indeed that way. For example, there is an observation with "cik = 7789" in the csv file, while the corresponding observation in the imported dataset has "cik = 7789.0"
art297
Opal | Level 21

In just testing your sample.csv file, I saw that the wrong informats were suggested for both date and time. I doubt if the following will correct your other problem, but it should correct date and time. As for your problem with the cik variable, can you post a new sample.csv file that contains at least one record that results in the problem you described?

 

data want;
	infile "&path./sample.csv"
    	delimiter = ','
    	missover
    	firstobs = 2
    	DSD 
    	lrecl = 32767;
informat IP $15.
date mmddyy10.
time time8.
zone $5.
cik $7.
accession $20.
extension $30.
code $3.
size 10.;
	format IP $15.;
	format date yymmdd10.;
	format time time8.;
	format zone $5.;
	format cik $7.;
	format accession $20.;
	format extension $30.;
	format code $3.;
	format size 10.;
	input 
		IP
		date 
		time 
		zone
		cik
		accession
		extension
		code
		size
	;
run;

Art, CEO, AnalystFinder.com

 

ballardw
Super User

@xyxu wrote:
I checked the csv file and it is indeed that way. For example, there is an observation with "cik = 7789" in the csv file, while the corresponding observation in the imported dataset has "cik = 7789.0"

Did you look at your file with a spreadsheet such as Excel or a text editor such as NotePad, WordPad, even the SAS editor, or similar? Excel will by default not show the .0 because it is being "helpful" as in "you don't need to see no stinking decimals with 0".

 

NEVER trust a spreadsheet to display a file created as CSV to actually show the contents. There are lots of appearance reformatting things that can happen with values that just happen to look numeric but should be character.

Reeza
Super User

@xyxu wrote:

 

 

A more general question is, how to perfectly import such csv files with data step? Thanks!


1. First import the file with PROC IMPORT and set GUESSINGROWS=MAX (except if your file is millions of rows, then may use 1000000). 

2. Check the log. The code in the log will be pretty darn close to what you want. 

3. Copy the code from the log to an editor, modify as necessary. 

4. Confirm against a record layout file and/or verification of the data via the highly tested eyeball method. 

xyxu
Quartz | Level 8
Thanks for the reply. Unfortunately I do have millions of rows for each csv file. I tried GUESSINGROWS=MAX and it takes forever to run the code.

What does it mean by "Confirm against a record layout file"?
Reeza
Super User

Usually when someone sends or creates massive sets of files they’ll also create documentation. One of the common items included in the documentation is the record layout which indicates variable types and length so a user doesn’t have to guess at the types or where a variable starts or ends. 

 

Then, as I stated earlier, set it to 1 million. If all the files have the same layout you only need to do this once and use the same data step code to read all. 


@xyxu wrote:
Thanks for the reply. Unfortunately I do have millions of rows for each csv file. I tried GUESSINGROWS=MAX and it takes forever to run the code.

What does it mean by "Confirm against a record layout file"?

 

Tom
Super User Tom
Super User

The main thing is that if you want to read character strings into date and time variables then you need to tell SAS what INFORMAT to use when reading it.

 

Also don't use the FORMAT statement as it it is intended to DEFINE the variable's type and length.  SAS will use the format specification you attach to a variable to make a GUESS about how to the define the variable, if the first place you reference the variable name is in the FORMAT statement, but that is not the purpose of a FORMAT statement. The purpose of the FORMAT statement is tell SAS how to display the value of the variable. So for DATE and TIME (and DATETIME) variables this is very important as the raw number of days or number of seconds is not that easy for humans to understand.  But for most variables there is no need to attach either a FORMAT or an INFORMAT.  SAS already knows how to read and write numbers and strings.  

 

data want;
  infile "&path\sample.csv" dsd firstobs=2 truncover ;
  length IP $15 date 8 time 8 zone $5 cik $7 accession $20 extension $30 code $3 size 8;
  format date yymmdd10. time time.;
  informat date mmddyy. time time.;
  input IP -- size ;
run;

Note that SAS will normally use BEST12. to display numbers when no explicit format is specified so there is not much value in attaching the F10. format to the SIZE variable.  But you might want to consider attaching the COMMA13. format to it if you have large numbers to make them easier for humans to read.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 27184 views
  • 2 likes
  • 6 in conversation