I have a bunch of large CSV files to import, each of which well exceeds the maximum number of rows an Excel spreadsheet can hold. The data are further complicated by a mixture of different formats of the same column. For instance, dates are sometimes recorded as "16/12/20" and sometimes "16 12 20". SAS fails to correctly import the data, seemingly missing those observations with atypical data formats along the course of import. Moreover, as the files are CSV's, handy options of PROC IMPORT like MIXED and SCANTEXT are not available.
My objectives of import are as follows:
(1) Correctly import the data into SAS;
(2) If not possible without the formidable task of finding the atypical observations in the original dataset and changing their formats, then set the columns housing the data with atypical formats as missing values instead of simply skipping them.
Any suggestions? Thank you!
Read the field into a temporary character variable and add code for detecting the format and doing an appropriate conversion.
If it's only slashes vs. blanks, that can be handled with a single TRANSLATE call.
@Season wrote:
Thank you! Could you please provide some hint on how to construct temporary variables? Hopefully this does not entail the specification of each of every column in the CSV file like I see in many SAS codes importing CSV's. I have too many columns to specify them one by one.
First of all: do NOT use PROC IMPORT in production for text files. NEVER. EVER. I mean it. It is much too prone to create inconsistent results, and will sometimes result in a seemingly successful code that created garbage in reality.
You can use PROC IMPORT to create a basic DATA step, which you grab from the log and then modify/expand as needed. You will find that this code is typical for machine-created code. Clumsy, hard to read, and hard to maintain. But it is a beginning.
Here a quick example for converting your strings:
data want;
infile datalines dsd;
input _date :$10.;
format date yymmdd10.;
date = input(translate(strip(_date),"/"," "),ddmmyy10.);
datalines;
16/12/20
16 12 20
;Note that this is not necessary for the examples you posted. DDMMYY10. will read a date with blanks the same as with slashes.
If you are using PROC IMPORT or the Import Wizard to import the CSV files, you can use the GUESSINGROWS= statement (Number of rows to guess in the Import Wizard Options box) to tell SAS to look beyond the first 20 rows (the default) to determine a variable's type. If you have different formats in a column, you need it to be imported as character and then you can make changes after the data is read in.
proc import datafile='c:\temp\test.csv'
 out=data_one dbms=csv replace;
guessingrows=1000;
run;
@Kathryn_SAS wrote:
...If you have different formats in a column, you need it to be imported as character and then you can make changes after the data is read in.
I've had fairly good luck with SAS using the ANYDTDTE , ANYDTDTM or ANYDTTME with Proc Import generated code when date, datetime or time values appear in different layouts for a single variable. But I don't remember any where space was the delimiter between the data elements.
However that is best when the years are 4 digits. With 2 digit years you have to pray that the order of values entered does match the national language settings.
Thanks for the input! Yes, you are right. Space serves as the delimiter in none of the three informats you mentioned in your post.
@ballardw wrote:
However that is best when the years are 4 digits. With 2 digit years you have to pray that the order of values entered does match the national language settings.
I am also rather apprehensive on this issue. With no further information in my dataset, it is likely that I end up designating the digits with the wrong meaning. However, as dates are not the primary concern for the time being, I have to focus on the goal of successful import right now and think about the issue of dates when conditions are permitted.
Thank you so much! That is a very handy option.
By the way, in addition to the GUESSINGROWS statement, are there any useful options and/or statements available in importing CSV's? As I have said in my original question, many of the options and/or statements of PROC IMPORT are not available for CSV's, which is frustrating.
When importing delimited files, the options and statements that are available are documented here:
Thanks for your response. But I beg to point out that the options listed in SAS Help are not very informative as many of those listed there are not available for CSV's.
The documentation link that I sent is specifically for delimited files - csv, tab, dlm - so it does not include statements/options that are not available for you to use for your CSV files.
Another thought would be that you can run PROC IMPORT because it uses DATA step code behind the scenes. Then you could grab that code either from the log or using Run -> Recall Last Submit and then you could make adjustments to that DATA step code as others have suggested. You just wouldn't have to start from scratch.
@Kathryn_SAS wrote:
The documentation link that I sent is specifically for delimited files - csv, tab, dlm - so it does not include statements/options that are not available for you to use for your CSV files.
Thank you for your reminder! I should had taken a second look.
@Kathryn_SAS wrote:Another thought would be that you can run PROC IMPORT because it uses DATA step code behind the scenes. Then you could grab that code either from the log or using Run -> Recall Last Submit and then you could make adjustments to that DATA step code as others have suggested. You just wouldn't have to start from scratch.
Thank you for your valuable piece of information! So that makes DATA step-based import of CSV files easier. Starting from scratch is so painful and boring for a data analyst who wish to concentrate on data analysis instead of such dull, monotonic work.
What do you mean by varying? Do you mean that it changes from file to file? Or do you mean it changes within a single file?
Your example is not actually a problem for SAS to read (no idea if it is problem for PROC IMPORT to guess how to handle).
data test;
  infile cards dsd truncover;
  input date :ddmmyy.;
  format date yymmdd10.;
cards;
"16/12/20"
"16 12 20"
;Obs date 1 2020-12-16 2 2020-12-16
@Tom wrote:
What do you mean by varying? Do you mean that it changes from file to file? Or do you mean it changes within a single file?
I mean "within a single file".
@Tom wrote:Your example is not actually a problem for SAS to read (no idea if it is problem for PROC IMPORT to guess how to handle).
data test; infile cards dsd truncover; input date :ddmmyy10.; format date yymmdd10.; cards; "16/12/20" "16 12 20" ;Obs date 1 2020-12-16 2 2020-12-16
Yes, I did try PROC IMPORT. Does importing the file with the DATA step necessitates specification of the name and informat of each of every variable in the CSV? That would be a very formidable job as I have possibly thousands of columns in all.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
