- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have a fairly large data file in csv format. I am trying to figure out how to read this through the data step and I am having problems. This is because most variables have variable lengths and also some fields have long comment entries and have what I believe are return lines (carriage returns..?). After some unsuccessful attempts, I decided to try and read all the data in one column and manually count the length of each variable row by row and retain the longest length returned for each column. Then I use this information to build a data step input line in a macro to re-read the original csv file. Things almost work except for the following:
- I am still picking up the extra lines due to some variables having long comment entries spanning multiple lines. I played around with the "termstr = crlf" option but did not have any luck.
- As constructed, all variables are picked up as text. Unless someone has any tricks on how to automatically identify certain columns as numeric or date, will I have to do this manually?
The data file is publicly available at:
http://www.californiasolarstatistics.ca.gov/current_data_files/
(click on "Current Working Data Set").
Program:
filename csi_in "C:\Workdata\Data\CSI\WorkingDataSet_1-02-2013.csv" ;
data test_b1 (keep = junk test) test_b1b (keep = Id C_name) ;
attrib C_name length = $32 ;
infile csi_in obs = 1 length = recl lrecl = 5000 ;
input @ ; /* hold line and get size of line */
input junk $varying5000. recl ;
test = countw(junk,',') ; /* get # of columns */
call symput('varl',test) ; /* store column count in global macro varialbe */
output test_b1 ;
do Id = 1 to test ;
/* build list of column names - strip spaces & other characters not used for column naming */
C_name = compress(compress(scan(junk,Id,','),'()-/#%','p')) ;
output test_b1b ;
end ;
run ;
%put &varl ;
data test_b2 (keep = length_test: test) test_b2b (keep = Id C_length) ;
array length_test(&varl) ;
retain length_test ;
infile csi_in firstobs = 2 length = recl lrecl = 800 end = eof /* termstr = crlf */ ;
input @ ; /* hold line and get size of line */
input junk $varying800. recl ;
test = countw(junk,',','mq') ; /* get count of columns for current row */
/* delete rows having comment entries - these are from prior row */
if trim(scan(junk,2,',','mq')) not in ('PG&E','GRID Alternatives','CCSE','SCE') then delete ;
if eof then do ;
do Id = 1 to &varl ; /* store max length of column */
C_length = max(1,length_test(Id)) ;
output test_b2b ;
end ;
end ;
do i = 1 to test ;
temp_length = lengthc(scan(junk,i,',','mq')) ; /* get length of the value stored in variable */
if _n_ = 1 then do ; /* if the first row - initalize */
length_test(i) = temp_length ;
end ; /* if column was not picked up before - update */
else if ((length_test(i) = .) and (temp_length ne 0)) then do ;
length_test(i) = temp_length ;
end ; /*update if current length is bigger than prior lengths */
else if (length_test(i) lt temp_length) then do ;
length_test(i) = temp_length ;
end ;
end ;
output test_b2 ;
run ;
proc sql ;
update Test_b1b /* manual fix - column names cannot start with a number */
set C_name = 'TrdPartyOwner'
where C_name = '3rdPartyOwner' ;
create table Colm_input_att as /* create attribute of colmn: column name colon modifier character type length */
select a1.Id,cat(a1.C_name,' : $',a2.C_length,'.') as Inf_c
length = 45
from test_b1b as a1 natural inner join test_b2b as a2 ;
quit ;
%macro read_csi2(col_c,file_r_n) ;
proc sql ; /* store column attribute in macro variable */
select Inf_c
into :col_att separated by '/'
from Colm_input_att ;
quit ;
%let j = 1 ;
data test_csi ;
infile &file_r_n firstobs = 2 dsd missover termstr = crlf ;
input
%do %until(%scan(&col_att,&j,'/') eq ) ;
%let temp_col_att = %scan(&col_att,&j,'/') ;
&temp_col_att
%let j = %eval(&j+1) ;
%end ;
;
run ;
%mend read_csi2 ;
%read_csi2(&varl,csi_in)
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For this file you can try to fix it by concatenating lines until the quotes balance. When I try that it reduces the number of lines by 1,168.
This program will replace the extra line breaks with vertical bar '|'. You could use another character or just not put in anything.
filename new temp;
data _null_ ;
if eof then put 'NOTE: Records read=' newn 'Records with missing quotes=' missq ;
infile 'c:\downloads\WorkingDataSet_1-09-2013.csv' dlm=',' dsd truncover
lrecl=10000 termstr=crlf
firstobs=1
end=eof
;
file new lrecl=10000;
nq=0;
do until (mod(nq,2)=0 or eof );
input;
newn+1;
nq = nq + countc(_infile_,'"');
put _infile_ @;
if mod(nq,2) then do;
missq+1;
put '|' @;
end;
end;
put;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you identify a record number or text string that can be used to identify one or more of the problem records?
This is what I get for record count when I use TERMSTR=CLRF to read the file. What version of SAS are you using? I don't think TERMSTR work with SAS 9.1.
19 data _null_;
20 infile ft59f001 termstr=crlf;
21 input;
22 run;
NOTE: The infile FT59F001 is:
(system-specific pathname),
(system-specific file attributes)
NOTE: 127353 records were read from the infile (system-specific pathname).
The minimum record length was 0.
The maximum record length was 2652.
NOTE: DATA statement used (Total process time):
real time 0.58 seconds
cpu time 0.38 seconds
19 data _null_;
20 infile ft59f001 termstr=crlf;
21 input;
22 if index(_infile_,'0a'x) then list;
23 run;
NOTE: The infile FT59F001 is:
(system-specific pathname),
(system-specific file attributes)
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
30845 PGE-MASHTR2-00212,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,258980.0,27602
101 3.3,,70.506,0.93466,65.899,Incentive Claim Request Review,,,Residential,,Residential,Salinas,Montere
201 y,CA,93907,,,2010-12-01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-09-14,,,,,,,,," DICK EMARD ELEC
301 TRIC INC.DBA EMARD ELECTRIC",794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,,,,,
ZONE 5544244404442444542444455442233333322244442444542444455442444044424445424444554422662222222222222222
NUMR 429309E3A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCCCCCC
401 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Active,,,,,258980.0,, 467
31014 PGE-MASHTR2-00214,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,404273.0,77595
101 3.0,,101.574,0.97459,98.994,Incentive Claim Request Review,,,Residential,,Residential,Greenfield,Mon
201 terey,CA,93921,,,2010-12-01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-06-02,,,,,,,,," DICK EMARD
301 ELECTRIC INC.DBA EMARD ELECTRIC",794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,
ZONE 4444554424440444244454244445544223333332224444244454244445544244404442444542444455442266222222222222
NUMR 5C53429309E3A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCC
401 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Active,,,,,404273.0,, 471
31386 PGE-MASHTR2-00211,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,57677.0,97779.
101 89,,16.201,0.90833,14.716,Completed,,,Residential,,Residential,Monterey,Monterey,CA,93940,,,2010-12-
201 01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-06-02,,2012-04-07,,2012-04-13,,,,," DICK EMARD ELECT
2 The SAS System 07:35 Thursday, January 10, 2013
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
301 RIC INC.DBA EMARD ELECTRIC",794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,,,,,,
ZONE 5442444044424445424444554422333333222444424445424444554424440444244454244445544226622222222222222222
NUMR 29309E3A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCCCCCCC
401 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Completed/PBI-In Payment,,,,,57677.0,, 483
31387 PGE-MASHTR2-00213,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,41474.0,,,10.9
101 87,0.96049,10.553,Incentive Claim Request Review,,,Residential,,Residential,Monterey,Monterey,CA,939
201 40,,,2010-12-01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-06-02,,,,,,,,," DICK EMARD ELECTRIC INC
301 .DBA EMARD ELECTRIC",794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,,,,,,,,,,,,,
ZONE 0444244454244445544223333332224444244454244445544244404442444542444455442266222222222222222222222222
NUMR A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCCCCCCCCCCCCCC
401 ,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Active,,,,,41474.0,, 458
NOTE: 127353 records were read from the infile (system-specific pathname).
The minimum record length was 0.
The maximum record length was 2652.
NOTE: DATA statement used (Total process time):
real time 0.59 seconds
cpu time 0.56 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am using 9.3. The easiest way to identify a problem record is too look at the first two columns (application number and program administrator). An invalid record will show up as a "message". Some problem records are in rows (as displayed by notepad++):
10439
19204
19799
24719
24743
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For this file you can try to fix it by concatenating lines until the quotes balance. When I try that it reduces the number of lines by 1,168.
This program will replace the extra line breaks with vertical bar '|'. You could use another character or just not put in anything.
filename new temp;
data _null_ ;
if eof then put 'NOTE: Records read=' newn 'Records with missing quotes=' missq ;
infile 'c:\downloads\WorkingDataSet_1-09-2013.csv' dlm=',' dsd truncover
lrecl=10000 termstr=crlf
firstobs=1
end=eof
;
file new lrecl=10000;
nq=0;
do until (mod(nq,2)=0 or eof );
input;
newn+1;
nq = nq + countc(_infile_,'"');
put _infile_ @;
if mod(nq,2) then do;
missq+1;
put '|' @;
end;
end;
put;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much! The text file created by your program worked just right! I was hoping if its not too much trouble, if you could walk me through what your code is doing so I may understand it better.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The IF EOF line is just to produce a report of some summary numbers it has calculated (see the END= option on the INFILE statement). I like to put these type of summary reports at the top of the data step so that it can handle empty files. Also so that the PUT statements for the report are before any FILE statements to insure the report is in the LOG instead of appended to the output file.
The INFILE statement points to the source file. You really only need the LRECL and END options for this program. The others were there from when I was trying to read the file without modifying it. The LRECL is because SAS's defaule linesize is very small (256). You should set the LRECL large enough for handle the incoming data. Go ahead and set it way larger than you actually think you need. The END option creates a variable that will be set to 1 (TRUE) when the end of the infile has been reached.
The FILE statement defines where to write the modified version of the file. Make sure to set the LRECL long enough for the concatenated lines. Check the NOTES that SAS puts in the log at the end of data step. You can set it larger than you need without any issue. (The FILENAME statement is just to make a temporary file for testing. You will probably want to just reference the physical filename in the FILE statement the same as you had in your INFILE statement before.)
NQ is a counter variable to see how many quotes are in the current output line.
The DO loop will run until the quotes are balanced. That is the number of them is even. That is the remainder when divided by 2 is zero.
The INPUT statement without any variables will read in the next line from the input file and store it in the automatic variable _INFILE_.
NEWN+1 will keep a running count of how many lines have been read. This syntax is called a sum statement. It is equivalent to the two statements 'RETAIN NEWN 0' and 'NEWN = SUM(NEWN,1)' .
COUNTC(_INFILE_,'"') will count how many times the double quote character appears in the file. This is added to the current value of NQ.
PUT _INFILE_ @ will write the current line to the output file. The trailing @ will hold the pointer at the end in case we need to append another line.
The IF/THEN block here will check if the quotes are currently unbalanced. If so it increments the MISSQ counter and puts out a vertical bar (|) to mark where the line break used to be.
The last PUT statement writes the end of the line character(s) to the output file to close the line that we left open with the trailing @ sign on the put statements inside of the main DO loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Tom,
I meant to reply earlier but got tied up. Thanks so much for your detailed answer.