Solved: Re: Use data step to import a CSV file with date and time

xyxu · Posted 01-06-2018 06:39 PM

Hi,

I was trying to import a csv file. A sample data is attached. The code I used is

data want;
	infile "&path\sample.csv"
    	delimiter = ','
    	missover
    	firstobs = 2
    	DSD 
    	lrecl = 32767;
	format IP $15.;
	format date yymmdd10.;
	format time hhmm.;
	format zone $5.;
	format cik $7.;
	format accession $20.;
	format extension $30.;
	format code $3.;
	format size 10.;
	input 
		IP $
		date 
		time 
		zone $
		cik $
		accession $
		extension $
		code $
		size
	;
run;

In the log, I got a lot of "NOTE: Invalid data for time in line 21 28-35."

From the imported dataset, there are two issues I can tell. The first is that, date and time variables are always missing. The second is, for variable cik, some takes values like "95521.0" or "873860.", while my intention was to keep it purely as a character.

A more general question is, how to perfectly import such csv files with data step? Thanks!

Tom · Posted 01-07-2018 11:07 AM

The main thing is that if you want to read character strings into date and time variables then you need to tell SAS what INFORMAT to use when reading it.

Also don't use the FORMAT statement as it it is intended to DEFINE the variable's type and length. SAS will use the format specification you attach to a variable to make a GUESS about how to the define the variable, if the first place you reference the variable name is in the FORMAT statement, but that is not the purpose of a FORMAT statement. The purpose of the FORMAT statement is tell SAS how to display the value of the variable. So for DATE and TIME (and DATETIME) variables this is very important as the raw number of days or number of seconds is not that easy for humans to understand. But for most variables there is no need to attach either a FORMAT or an INFORMAT. SAS already knows how to read and write numbers and strings.

data want;
  infile "&path\sample.csv" dsd firstobs=2 truncover ;
  length IP $15 date 8 time 8 zone $5 cik $7 accession $20 extension $30 code $3 size 8;
  format date yymmdd10. time time.;
  informat date mmddyy. time time.;
  input IP -- size ;
run;

Note that SAS will normally use BEST12. to display numbers when no explicit format is specified so there is not much value in attaching the F10. format to the SIZE variable. But you might want to consider attaching the COMMA13. format to it if you have large numbers to make them easier for humans to read.

View solution in original post

novinosrin · Posted 01-06-2018 06:46 PM

Your input statement requires bit of work.

You are trying to code a simple list input just by specifying datatypes of the variables and the type of delimiter. What you need is a modified list input with : (colon) followed by the informat to exactly read nonstandard data such as date and time. Alternatively, if modified list input is confusing , specify informats in an informat statement before input statement.

hmm I just noticed your format statement. Make all those as informats in an informat statement and retype the format statement as is after input statement. HTH

data want;
	infile "&path\sample.csv"
    	delimiter = ','
    	missover
    	firstobs = 2
    	DSD 
    	lrecl = 32767;
informat IP $15.
date yymmdd10.
time hhmm.
zone $5.
cik $7.
accession $20.
extension $30.
code $3.
size 10.;
	format IP $15.;
	format date yymmdd10.;
	format time hhmm.;
	format zone $5.;
	format cik $7.;
	format accession $20.;
	format extension $30.;
	format code $3.;
	format size 10.;
	input 
		IP $
		date 
		time 
		zone $
		cik $
		accession $
		extension $
		code $
		size
	;
run;

xyxu · Posted 01-06-2018 08:56 PM

Thanks! My remaining issue is the format of variable "cik". It is now character, which is correct, but for many observations, there are decimal points. How can I get rid of these weird results?

novinosrin · Posted 01-06-2018 09:01 PM

What do you mean by weird decimal points once you have read and stored them as character with a $7. informat? do you wanna manipulate and store only the numbers as chars that occur before decimal points?

xyxu · Posted 01-06-2018 09:08 PM

By weird decimal points I mean values like "95521.0" or "873860.". Yes, I want to keep these char type numbers as only integers, since they are identifiers for each firm. I thought that by restricting them to be chars, I would have only strings like "12345678" and "12345", instead of "12345.0".

novinosrin · Posted 01-06-2018 09:15 PM

just do a small and simple manipulation like:

cik-=ifc(index(cik,'.')>0, substr(cik,1,index(cik,'.')-1), cik);

Reeza · Posted 01-06-2018 10:14 PM

If it’s being read in like that it’s very likely those characters are in your text file. Verify your source data.

xyxu · Posted 01-06-2018 10:25 PM

I checked the csv file and it is indeed that way. For example, there is an observation with "cik = 7789" in the csv file, while the corresponding observation in the imported dataset has "cik = 7789.0"

art297 · Posted 01-07-2018 12:16 AM

In just testing your sample.csv file, I saw that the wrong informats were suggested for both date and time. I doubt if the following will correct your other problem, but it should correct date and time. As for your problem with the cik variable, can you post a new sample.csv file that contains at least one record that results in the problem you described?

data want;
	infile "&path./sample.csv"
    	delimiter = ','
    	missover
    	firstobs = 2
    	DSD 
    	lrecl = 32767;
informat IP $15.
date mmddyy10.
time time8.
zone $5.
cik $7.
accession $20.
extension $30.
code $3.
size 10.;
	format IP $15.;
	format date yymmdd10.;
	format time time8.;
	format zone $5.;
	format cik $7.;
	format accession $20.;
	format extension $30.;
	format code $3.;
	format size 10.;
	input 
		IP
		date 
		time 
		zone
		cik
		accession
		extension
		code
		size
	;
run;

Art, CEO, AnalystFinder.com

ballardw · Posted 01-08-2018 11:26 AM

@xyxu wrote:
I checked the csv file and it is indeed that way. For example, there is an observation with "cik = 7789" in the csv file, while the corresponding observation in the imported dataset has "cik = 7789.0"

Did you look at your file with a spreadsheet such as Excel or a text editor such as NotePad, WordPad, even the SAS editor, or similar? Excel will by default not show the .0 because it is being "helpful" as in "you don't need to see no stinking decimals with 0".

NEVER trust a spreadsheet to display a file created as CSV to actually show the contents. There are lots of appearance reformatting things that can happen with values that just happen to look numeric but should be character.

Reeza · Posted 01-06-2018 07:39 PM

@xyxu wrote:

A more general question is, how to perfectly import such csv files with data step? Thanks!

1. First import the file with PROC IMPORT and set GUESSINGROWS=MAX (except if your file is millions of rows, then may use 1000000).

2. Check the log. The code in the log will be pretty darn close to what you want.

3. Copy the code from the log to an editor, modify as necessary.

4. Confirm against a record layout file and/or verification of the data via the highly tested eyeball method.

xyxu · Posted 01-06-2018 08:08 PM

Thanks for the reply. Unfortunately I do have millions of rows for each csv file. I tried GUESSINGROWS=MAX and it takes forever to run the code.

What does it mean by "Confirm against a record layout file"?

Reeza · Posted 01-06-2018 10:17 PM

Usually when someone sends or creates massive sets of files they’ll also create documentation. One of the common items included in the documentation is the record layout which indicates variable types and length so a user doesn’t have to guess at the types or where a variable starts or ends.

Then, as I stated earlier, set it to 1 million. If all the files have the same layout you only need to do this once and use the same data step code to read all.

@xyxu wrote:
Thanks for the reply. Unfortunately I do have millions of rows for each csv file. I tried GUESSINGROWS=MAX and it takes forever to run the code.

What does it mean by "Confirm against a record layout file"?

Tom · Posted 01-07-2018 11:07 AM

The main thing is that if you want to read character strings into date and time variables then you need to tell SAS what INFORMAT to use when reading it.

Also don't use the FORMAT statement as it it is intended to DEFINE the variable's type and length. SAS will use the format specification you attach to a variable to make a GUESS about how to the define the variable, if the first place you reference the variable name is in the FORMAT statement, but that is not the purpose of a FORMAT statement. The purpose of the FORMAT statement is tell SAS how to display the value of the variable. So for DATE and TIME (and DATETIME) variables this is very important as the raw number of days or number of seconds is not that easy for humans to understand. But for most variables there is no need to attach either a FORMAT or an INFORMAT. SAS already knows how to read and write numbers and strings.

data want;
  infile "&path\sample.csv" dsd firstobs=2 truncover ;
  length IP $15 date 8 time 8 zone $5 cik $7 accession $20 extension $30 code $3 size 8;
  format date yymmdd10. time time.;
  informat date mmddyy. time time.;
  input IP -- size ;
run;

Note that SAS will normally use BEST12. to display numbers when no explicit format is specified so there is not much value in attaching the F10. format to the SIZE variable. But you might want to consider attaching the COMMA13. format to it if you have large numbers to make them easier for humans to read.

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away