There are a plethora of CSV files to read with different starting rows due to some comments at the beginning. For example,
Comment 1 Comment 2 ,X,Y,Z 1988,0.00,0.00,0.00 1989,0.00,0.00,0.00 1990,0.00,0.00,0.00 1991,0.00,0.00,0.00
In another file, for example,
Comment 3 Comment 4 Comment 5 Comment 6 Comment 7 Comment 8 ,P,Q,U,V,W 2001,0.00,0.00,0.00,0.00,0.00 2002,0.00,0.00,0.00,0.00,0.00 2003,0.00,0.00,0.00,0.00,0.00 2004,0.00,0.00,0.00,0.00,0.00 2005,0.00,0.00,0.00,0.00,0.00
So %MACRO with FIRSTOBS is impossible. One common rule I found is that the data part of every CSV file starts with a comma at the beginning (e.g., ",X" in the first and ",P" in the second). Would there be any way to trim those starting comments from the CSV files (probably INFILE, FILE, INPUT, PUT combinations) until SAS recognizes the line that starts with the comma? After then, PROC IMPORT would be able to read the rest properly. Thanks.
Do you have the list of the filenames? Or are they all in the same directory? Or follow some simple naming pattern?
Let's assume you have list of the filenames. If not you can easily make one.
data filenames ;
infile "ls /mydir/*.csv" pipe truncover ;
input fname $255.;
run;
Use the list to read the top of each file. Perhaps you want to save the comments?
data
comments (keep=filename comment)
firstrow (keep=filename row)
;
set filenames ;
filename = fname;
infile csv filevar=fname truncover end=eof;
do row=1 by 1 while (not eof);
input comment $200.;
if char(_infile_,1)=',' then leave ;
output comments;
end;
row=row+1;
output firstrow;
run;
Then you can use the list of first row number to generate data step (or proc import) code.
filename code temp;
data _null_;
set firstrows;
file code;
fileno+1;
put 'proc import datafile=' filename :$quote.
/ ' out=csv' fileno 'replace'
/ ' dbms=csv'
/ ';'
/ ' datarow=' row ';'
/ 'run;'
;
run;
%include code / source2;
How automatic does it need to be? I typically do NOT use PROC IMPORT to read CSV files. I has to guess what is in the file. You can guess much much better than PROC IMPORT can. Especially if your file only has numeric variables, like in your examples.
For you examples I would just look at the file using any text editor (SAS program editor for example) or simple data _null_ step and figure out the number of columns and the first row.
data one;
infile 'one.txt' firstobs=5 dsd truncover ;
input year x y z;
run;
If I have a bunch of files with the same 4 column, but varying number of comments you could make the step smart enough to skip the header on its own. Perhaps by looking for the header row?
data one;
infile 'one.txt' firstobs=1 dsd truncover ;
if _n_=1 then do while(1=1);
input ;
if _infile_=',X,Y,Z' then leave;
end;
input year x y z;
run;
Primarily, there are more than 200 CSVs with different variables (+different numbers of columns) as well as different comments. I would specify all FIRSTOBSs and INPUTs manually as you said if there are about 30 CSVs.
Do you have the list of the filenames? Or are they all in the same directory? Or follow some simple naming pattern?
Let's assume you have list of the filenames. If not you can easily make one.
data filenames ;
infile "ls /mydir/*.csv" pipe truncover ;
input fname $255.;
run;
Use the list to read the top of each file. Perhaps you want to save the comments?
data
comments (keep=filename comment)
firstrow (keep=filename row)
;
set filenames ;
filename = fname;
infile csv filevar=fname truncover end=eof;
do row=1 by 1 while (not eof);
input comment $200.;
if char(_infile_,1)=',' then leave ;
output comments;
end;
row=row+1;
output firstrow;
run;
Then you can use the list of first row number to generate data step (or proc import) code.
filename code temp;
data _null_;
set firstrows;
file code;
fileno+1;
put 'proc import datafile=' filename :$quote.
/ ' out=csv' fileno 'replace'
/ ' dbms=csv'
/ ';'
/ ' datarow=' row ';'
/ 'run;'
;
run;
%include code / source2;
Very much appreciate, but there is one issue—even with DATAROW, PROC IMPORT uses the very first row to recognize the variable names, so the variable name information doesn't survive in this way. Is there still a way to have SAS physically access and remove the comment rows until it confronts the row starting with a comma?
@Junyong wrote:
Very much appreciate, but there is one issue—even with DATAROW, PROC IMPORT uses the very first row to recognize the variable names, so the variable name information doesn't survive in this way. Is there still a way to have SAS physically access and remove the comment rows until it confronts the row starting with a comma?
yet another reason not to use PROC IMPORT.
It should be a trivial modification to the data step I posted to add another loop to read the rest of the file and write it to a new file.
data filenames ;
infile "ls /mydir/*.csv" pipe truncover ;
input fname $255.;
run;
data _null_;
set filenames ;
outfile=cats(fname,'.fixed');
infile in filevar=fname end=eof;
file out filevar=outfile;
do while (not eof);
input ;
if char(_infile_,1)=',' then found=1;
if found then put _infile_;
end;
run;
If your values are all numbers like in your example then just generate data step code instead. Let's make your two example files:
%let path=c:\downloads\;
filename txt1 "&path.file1.csv";
filename txt2 "&path.file2.csv";
options parmcards=txt1;
parmcards4;
Comment 1
Comment 2
,X,Y,Z
1988,0.00,0.00,0.00
1989,0.00,0.00,0.00
1990,0.00,0.00,0.00
1991,0.00,0.00,0.00
;;;;
options parmcards=txt2;
parmcards4;
Comment 3
Comment 4
Comment 5
Comment 6
Comment 7
Comment 8
,P,Q,U,V,W
2001,0.00,0.00,0.00,0.00,0.00
2002,0.00,0.00,0.00,0.00,0.00
2003,0.00,0.00,0.00,0.00,0.00
2004,0.00,0.00,0.00,0.00,0.00
2005,0.00,0.00,0.00,0.00,0.00
;;;;
And then analyze them:
data comments
files(rename=(row=firstobs comment=varlist))
;
fileno+1;
length memname $32 ;
memname=cats('CSV',fileno);
input fname $80.;
filename=fname ;
infile csv filevar=fname end=eof truncover;
do row=1 by 1 while(not eof);
input comment $100. ;
if char(comment,1)=',' then leave;
output comments;
end;
row=row+1;
comment=translate('year'||comment,' ',',');
output files;
cards;
c:\downloads\file1.csv
c:\downloads\file2.csv
;
And then use the generated data to write code to read them:
filename code temp;
data _null_;
set files ;
file code;
put 'data ' memname ';'
/ ' infile ' filename :$quote. 'dsd truncover ' firstobs= ';'
/ ' input ' varlist ';'
/ 'run;'
;
run;
%include code / source2;
2120 %include code / source2; NOTE: %INCLUDE (level 1) file CODE is file ...\#LN00157. 2121 +data CSV1 ; 2122 + infile "c:\downloads\file1.csv" dsd truncover firstobs=5 ; 2123 + input year X Y Z ; 2124 +run; NOTE: The infile "c:\downloads\file1.csv" is: Filename=c:\downloads\file1.csv, RECFM=V,LRECL=32767,File Size (bytes)=116, Last Modified=16Jan2020:16:55:22, Create Time=16Jan2020:16:47:06 NOTE: 4 records were read from the infile "c:\downloads\file1.csv". The minimum record length was 19. The maximum record length was 19. NOTE: The data set WORK.CSV1 has 4 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 2125 +data CSV2 ; 2126 + infile "c:\downloads\file2.csv" dsd truncover firstobs=9 ; 2127 + input year P Q U V W ; 2128 +run; NOTE: The infile "c:\downloads\file2.csv" is: Filename=c:\downloads\file2.csv, RECFM=V,LRECL=32767,File Size (bytes)=235, Last Modified=16Jan2020:16:55:22, Create Time=16Jan2020:16:47:06 NOTE: 5 records were read from the infile "c:\downloads\file2.csv". The minimum record length was 29. The maximum record length was 29. NOTE: The data set WORK.CSV2 has 5 observations and 6 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Thanks again, but I am still considering PROC IMPORT rather than DATA notwithstanding the shortcomings due to another reason. Here I attach one example.
This file was created by CMPT_ME_RETS using the 201911 CRSP database. It contains value- and equal-weighted returns for size portfolios. Each record contains returns for: Negative (not used) 30% 40% 30% 5 Quintiles 10 Deciles The portfolios are constructed at the end of Jun. The annual returns are from January to December. Missing data are indicated by -99.99 or -999. Value Weight Returns -- Monthly ,<= 0,Lo 30,Med 40,Hi 30,Lo 20,Qnt 2,Qnt 3,Qnt 4,Hi 20,Lo 10,Dec 2,Dec 3,Dec 4,Dec 5,Dec 6,Dec 7,Dec 8,Dec 9,Hi 10 192607, -99.99, 0.14, 1.59, 3.42, 0.37, 0.78, 1.68, 1.39, 3.67, -0.12, 0.52, -0.05, 1.31, 1.21, 2.04, 1.58, 1.29, 3.53, 3.71 192608, -99.99, 3.21, 2.77, 2.91, 2.26, 3.51, 3.75, 1.53, 3.07, 1.33, 2.55, 4.00, 3.20, 2.81, 4.45, 1.61, 1.49, 0.61, 3.79 192609, -99.99, -1.74, -0.88, 0.80, -1.39, -1.06, 0.05, -0.26, 0.81, 0.59, -2.00, -2.01, -0.46, -0.06, 0.14, -2.02, 0.74, -0.77, 1.25
Again the variable name row starts with a comma here too. However, the variable names are not appropriate enough for SAS to recognize since they contain not only math symbols such as <= 0 but also blanks such as Lo 30. Actually, PROC IMPORT does a great job in dealing with those situations.
You could easily write a program to read that file and generate a data step like this to read it.
data want ;
infile 'myfile.txt' dsd firstobs=14 truncover ;
label id ="Value Weight Returns -- Monthly"
var1='<= 0'
var2='Lo 30'
var3='Med 40'
var4='Hi 30'
var5='Lo 20'
var6='Qnt 2'
var7='Qnt 3'
var8='Qnt 4'
var9='Hi 20'
var10='Lo 10'
var11='Dec 2'
var12='Dec 3'
var13='Dec 4'
var14='Dec 5'
var15='Dec 6'
var16='Dec 7'
var17='Dec 8'
var18='Dec 9'
var19='Hi 10'
;
input id var1-var19;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.