Hi all,
I have a fairly large data file in csv format. I am trying to figure out how to read this through the data step and I am having problems. This is because most variables have variable lengths and also some fields have long comment entries and have what I believe are return lines (carriage returns..?). After some unsuccessful attempts, I decided to try and read all the data in one column and manually count the length of each variable row by row and retain the longest length returned for each column. Then I use this information to build a data step input line in a macro to re-read the original csv file. Things almost work except for the following:
- I am still picking up the extra lines due to some variables having long comment entries spanning multiple lines. I played around with the "termstr = crlf" option but did not have any luck.
- As constructed, all variables are picked up as text. Unless someone has any tricks on how to automatically identify certain columns as numeric or date, will I have to do this manually?
The data file is publicly available at:
http://www.californiasolarstatistics.ca.gov/current_data_files/
(click on "Current Working Data Set").
Program:
filename csi_in "C:\Workdata\Data\CSI\WorkingDataSet_1-02-2013.csv" ;
data test_b1 (keep = junk test) test_b1b (keep = Id C_name) ;
attrib C_name length = $32 ;
infile csi_in obs = 1 length = recl lrecl = 5000 ;
input @ ; /* hold line and get size of line */
input junk $varying5000. recl ;
test = countw(junk,',') ; /* get # of columns */
call symput('varl',test) ; /* store column count in global macro varialbe */
output test_b1 ;
do Id = 1 to test ;
/* build list of column names - strip spaces & other characters not used for column naming */
C_name = compress(compress(scan(junk,Id,','),'()-/#%','p')) ;
output test_b1b ;
end ;
run ;
%put &varl ;
data test_b2 (keep = length_test: test) test_b2b (keep = Id C_length) ;
array length_test(&varl) ;
retain length_test ;
infile csi_in firstobs = 2 length = recl lrecl = 800 end = eof /* termstr = crlf */ ;
input @ ; /* hold line and get size of line */
input junk $varying800. recl ;
test = countw(junk,',','mq') ; /* get count of columns for current row */
/* delete rows having comment entries - these are from prior row */
if trim(scan(junk,2,',','mq')) not in ('PG&E','GRID Alternatives','CCSE','SCE') then delete ;
if eof then do ;
do Id = 1 to &varl ; /* store max length of column */
C_length = max(1,length_test(Id)) ;
output test_b2b ;
end ;
end ;
do i = 1 to test ;
temp_length = lengthc(scan(junk,i,',','mq')) ; /* get length of the value stored in variable */
if _n_ = 1 then do ; /* if the first row - initalize */
length_test(i) = temp_length ;
end ; /* if column was not picked up before - update */
else if ((length_test(i) = .) and (temp_length ne 0)) then do ;
length_test(i) = temp_length ;
end ; /*update if current length is bigger than prior lengths */
else if (length_test(i) lt temp_length) then do ;
length_test(i) = temp_length ;
end ;
end ;
output test_b2 ;
run ;
proc sql ;
update Test_b1b /* manual fix - column names cannot start with a number */
set C_name = 'TrdPartyOwner'
where C_name = '3rdPartyOwner' ;
create table Colm_input_att as /* create attribute of colmn: column name colon modifier character type length */
select a1.Id,cat(a1.C_name,' : $',a2.C_length,'.') as Inf_c
length = 45
from test_b1b as a1 natural inner join test_b2b as a2 ;
quit ;
%macro read_csi2(col_c,file_r_n) ;
proc sql ; /* store column attribute in macro variable */
select Inf_c
into :col_att separated by '/'
from Colm_input_att ;
quit ;
%let j = 1 ;
data test_csi ;
infile &file_r_n firstobs = 2 dsd missover termstr = crlf ;
input
%do %until(%scan(&col_att,&j,'/') eq ) ;
%let temp_col_att = %scan(&col_att,&j,'/') ;
&temp_col_att
%let j = %eval(&j+1) ;
%end ;
;
run ;
%mend read_csi2 ;
%read_csi2(&varl,csi_in)
For this file you can try to fix it by concatenating lines until the quotes balance. When I try that it reduces the number of lines by 1,168.
This program will replace the extra line breaks with vertical bar '|'. You could use another character or just not put in anything.
filename new temp;
data _null_ ;
if eof then put 'NOTE: Records read=' newn 'Records with missing quotes=' missq ;
infile 'c:\downloads\WorkingDataSet_1-09-2013.csv' dlm=',' dsd truncover
lrecl=10000 termstr=crlf
firstobs=1
end=eof
;
file new lrecl=10000;
nq=0;
do until (mod(nq,2)=0 or eof );
input;
newn+1;
nq = nq + countc(_infile_,'"');
put _infile_ @;
if mod(nq,2) then do;
missq+1;
put '|' @;
end;
end;
put;
run;
Can you identify a record number or text string that can be used to identify one or more of the problem records?
This is what I get for record count when I use TERMSTR=CLRF to read the file. What version of SAS are you using? I don't think TERMSTR work with SAS 9.1.
Hi,
I am using 9.3. The easiest way to identify a problem record is too look at the first two columns (application number and program administrator). An invalid record will show up as a "message". Some problem records are in rows (as displayed by notepad++):
10439
19204
19799
24719
24743
For this file you can try to fix it by concatenating lines until the quotes balance. When I try that it reduces the number of lines by 1,168.
This program will replace the extra line breaks with vertical bar '|'. You could use another character or just not put in anything.
filename new temp;
data _null_ ;
if eof then put 'NOTE: Records read=' newn 'Records with missing quotes=' missq ;
infile 'c:\downloads\WorkingDataSet_1-09-2013.csv' dlm=',' dsd truncover
lrecl=10000 termstr=crlf
firstobs=1
end=eof
;
file new lrecl=10000;
nq=0;
do until (mod(nq,2)=0 or eof );
input;
newn+1;
nq = nq + countc(_infile_,'"');
put _infile_ @;
if mod(nq,2) then do;
missq+1;
put '|' @;
end;
end;
put;
run;
Thank you so much! The text file created by your program worked just right! I was hoping if its not too much trouble, if you could walk me through what your code is doing so I may understand it better.
The IF EOF line is just to produce a report of some summary numbers it has calculated (see the END= option on the INFILE statement). I like to put these type of summary reports at the top of the data step so that it can handle empty files. Also so that the PUT statements for the report are before any FILE statements to insure the report is in the LOG instead of appended to the output file.
The INFILE statement points to the source file. You really only need the LRECL and END options for this program. The others were there from when I was trying to read the file without modifying it. The LRECL is because SAS's defaule linesize is very small (256). You should set the LRECL large enough for handle the incoming data. Go ahead and set it way larger than you actually think you need. The END option creates a variable that will be set to 1 (TRUE) when the end of the infile has been reached.
The FILE statement defines where to write the modified version of the file. Make sure to set the LRECL long enough for the concatenated lines. Check the NOTES that SAS puts in the log at the end of data step. You can set it larger than you need without any issue. (The FILENAME statement is just to make a temporary file for testing. You will probably want to just reference the physical filename in the FILE statement the same as you had in your INFILE statement before.)
NQ is a counter variable to see how many quotes are in the current output line.
The DO loop will run until the quotes are balanced. That is the number of them is even. That is the remainder when divided by 2 is zero.
The INPUT statement without any variables will read in the next line from the input file and store it in the automatic variable _INFILE_.
NEWN+1 will keep a running count of how many lines have been read. This syntax is called a sum statement. It is equivalent to the two statements 'RETAIN NEWN 0' and 'NEWN = SUM(NEWN,1)' .
COUNTC(_INFILE_,'"') will count how many times the double quote character appears in the file. This is added to the current value of NQ.
PUT _INFILE_ @ will write the current line to the output file. The trailing @ will hold the pointer at the end in case we need to append another line.
The IF/THEN block here will check if the quotes are currently unbalanced. If so it increments the MISSQ counter and puts out a vertical bar (|) to mark where the line break used to be.
The last PUT statement writes the end of the line character(s) to the output file to close the line that we left open with the trailing @ sign on the put statements inside of the main DO loop.
Tom,
I meant to reply earlier but got tied up. Thanks so much for your detailed answer.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.