BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
asishgautam
Calcite | Level 5

Hi all,

I have a fairly large data file in csv format.  I am trying to figure out how to read this through the data step and I am having problems.  This is because most variables have variable lengths and also some fields have long comment entries and have what I believe are return lines (carriage returns..?).  After some unsuccessful attempts, I decided to try and read all the data in one column and manually count the length of each variable row by row and retain the longest length returned for each column.  Then I use this information to build a data step input line in a macro to re-read the original csv file.  Things almost work except for the following:

- I am still picking up the extra lines due to some variables having long comment entries spanning multiple lines.  I played around with the "termstr = crlf" option but did not have any luck.

- As constructed, all variables are picked up as text.  Unless someone has any tricks on how to automatically identify certain columns as numeric or date, will I have to do this manually?

  The data file is publicly available at:

http://www.californiasolarstatistics.ca.gov/current_data_files/

(click on "Current Working Data Set").

Program:

filename csi_in "C:\Workdata\Data\CSI\WorkingDataSet_1-02-2013.csv" ;

data test_b1 (keep = junk test) test_b1b (keep = Id C_name) ;

    attrib C_name length = $32 ;

    infile csi_in obs = 1 length = recl lrecl = 5000 ;

    input @ ; /* hold line and get size of line */

    input junk $varying5000. recl ;

    test = countw(junk,',') ; /* get # of columns */

    call symput('varl',test) ; /* store column count in global macro varialbe */

    output test_b1 ;

    do Id = 1 to test ;

        /* build list of column names - strip spaces & other characters not used for column naming */

        C_name = compress(compress(scan(junk,Id,','),'()-/#%','p')) ;

        output test_b1b ;

    end ;

run ;

%put &varl ;

data test_b2 (keep = length_test: test) test_b2b (keep = Id C_length) ;

    array length_test(&varl) ;

    retain length_test ;

    infile csi_in firstobs = 2 length = recl lrecl = 800 end = eof /* termstr = crlf */ ;

    input @ ; /* hold line and get size of line */

    input junk $varying800. recl ;

    test = countw(junk,',','mq') ; /* get count of columns for current row */

    /* delete rows having comment entries - these are from prior row */

    if trim(scan(junk,2,',','mq')) not in ('PG&E','GRID Alternatives','CCSE','SCE') then delete ;

    if eof then do ;

        do Id = 1 to &varl ; /* store max length of column */

            C_length = max(1,length_test(Id)) ;

            output test_b2b ;

        end ;

    end ;

    do i = 1 to test ;

        temp_length = lengthc(scan(junk,i,',','mq')) ; /* get length of the value stored in variable */

        if _n_ = 1 then do ; /* if the first row - initalize */

            length_test(i) = temp_length ;

        end ; /* if column was not picked up before - update */

        else if ((length_test(i) = .) and (temp_length ne 0)) then do ;

            length_test(i) = temp_length ;

        end ; /*update if current length is bigger than prior lengths */

        else if (length_test(i) lt temp_length) then do ;

            length_test(i) = temp_length ;

        end ;

    end ;

    output test_b2 ;

run ;

proc sql  ;

    update Test_b1b /* manual fix - column names cannot start with a number */

    set C_name = 'TrdPartyOwner'

    where C_name = '3rdPartyOwner' ;

    create table Colm_input_att as /* create attribute of colmn: column name colon modifier character type length */

    select a1.Id,cat(a1.C_name,' : $',a2.C_length,'.') as Inf_c

           length = 45

    from test_b1b as a1 natural inner join test_b2b as a2 ;

quit ;

%macro read_csi2(col_c,file_r_n) ;

    proc sql ; /* store column attribute in macro variable */

        select Inf_c

        into :col_att separated by '/'

        from Colm_input_att ;

    quit ;

    %let j = 1 ;

    data test_csi ;

        infile &file_r_n firstobs = 2 dsd missover termstr = crlf ;

        input

        %do %until(%scan(&col_att,&j,'/') eq ) ;

            %let temp_col_att = %scan(&col_att,&j,'/') ;

            &temp_col_att

            %let j = %eval(&j+1) ;

        %end ;

        ;

    run ;

%mend read_csi2 ;

%read_csi2(&varl,csi_in)

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

For this file you can try to fix it by concatenating lines until the quotes balance.  When I try that it reduces the number of lines by 1,168.

This program will replace the extra line breaks with vertical bar '|'.  You could use another character or just not put in anything.

filename new temp;

data _null_ ;

  if eof then put 'NOTE: Records read=' newn 'Records with missing quotes=' missq ;

  infile 'c:\downloads\WorkingDataSet_1-09-2013.csv'  dlm=',' dsd truncover

    lrecl=10000  termstr=crlf

    firstobs=1

    end=eof

  ;

  file new lrecl=10000;

  nq=0;

  do until (mod(nq,2)=0 or eof );

     input;

     newn+1;

     nq = nq + countc(_infile_,'"');

     put _infile_ @;

     if mod(nq,2) then do;

       missq+1;

       put '|' @;

     end;

  end;

  put;

run;

View solution in original post

6 REPLIES 6
data_null__
Jade | Level 19

Can you identify a record number or text string that can be used to identify one or more of the problem records?

This is what I get for record count when I use TERMSTR=CLRF to read the file.  What version of SAS are you using?  I don't think TERMSTR work with SAS 9.1.

18         filename FT59F001 "~/WorkingDataSet_1-09-2013.csv" lrecl=32767;
19         data _null_;
20            infile ft59f001 termstr=crlf;
21            input;
22            run;

NOTE:
The infile FT59F001 is:
      (system-specific pathname),
      (system-specific file attributes)

NOTE:
127353 records were read from the infile (system-specific pathname).
      The minimum record length was
0.
      The maximum record length was
2652.
NOTE: DATA statement used (Total process time):
      real time          
0.58 seconds
      cpu time           
0.38 seconds
And here are the offending records....
18         filename FT59F001 "~/WorkingDataSet_1-09-2013.csv" lrecl=32767;
19         data _null_;
20            infile ft59f001 termstr=crlf;
21            input;
22            if index(_infile_,'0a'x) then list;
23            run;

NOTE:
The infile FT59F001 is:
      (system-specific pathname),
      (system-specific file attributes)

RULE:     ----+----
1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0                     
30845     PGE-MASHTR2-00212,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,258980.0,27602
    
101  3.3,,70.506,0.93466,65.899,Incentive Claim Request Review,,,Residential,,Residential,Salinas,Montere
    
201  y,CA,93907,,,2010-12-01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-09-14,,,,,,,,," DICK EMARD ELEC

     301  TRIC INC.DBA EMARD ELECTRIC"
,794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,,,,,
    ZONE 
5544244404442444542444455442233333322244442444542444455442444044424445424444554422662222222222222222
    NUMR  429309E3A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCCCCCC
    
401  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Active,,,,,258980.0,, 467
31014     PGE-MASHTR2-00214,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,404273.0,77595
    
101  3.0,,101.574,0.97459,98.994,Incentive Claim Request Review,,,Residential,,Residential,Greenfield,Mon
    
201  terey,CA,93921,,,2010-12-01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-06-02,,,,,,,,," DICK EMARD

     301  ELECTRIC INC.DBA EMARD ELECTRIC"
,794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,
    ZONE 
4444554424440444244454244445544223333332224444244454244445544244404442444542444455442266222222222222
    NUMR  5C53429309E3A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCC
    
401  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Active,,,,,404273.0,, 471
31386     PGE-MASHTR2-00211,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,57677.0,97779.
    
101  89,,16.201,0.90833,14.716,Completed,,,Residential,,Residential,Monterey,Monterey,CA,93940,,,2010-12-
    
201  01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-06-02,,2012-04-07,,2012-04-13,,,,," DICK EMARD ELECT
2                                                          The SAS System                           07:35 Thursday, January 10, 2013

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0                     

     301  RIC INC.DBA EMARD ELECTRIC"
,794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,,,,,,
    ZONE 
5442444044424445424444554422333333222444424445424444554424440444244454244445544226622222222222222222
    NUMR  29309E3A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCCCCCCC
    
401  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Completed/PBI-In Payment,,,,,57677.0,, 483
31387     PGE-MASHTR2-00213,PG&E,Multifamily Affordable Solar Housing,MASH Track 2,EPBB,Track 2,41474.0,,,10.9
    
101  87,0.96049,10.553,Incentive Claim Request Review,,,Residential,,Residential,Monterey,Monterey,CA,939
    
201  40,,,2010-12-01,,,,2010-12-01,,2010-12-01,,,2010-12-01,,2011-06-02,,,,,,,,," DICK EMARD ELECTRIC INC

     301  .DBA EMARD ELECTRIC"
,794007," DICK EMARD ELECTRIC INC.DBA EMARD ELECTRIC",no,,,,,,,,,,,,,,,,,,,,,,,,
    ZONE 
0444244454244445544223333332224444244454244445544244404442444542444455442266222222222222222222222222
    NUMR  A42105D12405C5342932C794007C20493B05D12405C53429309E3A42105D12405C5342932CEFCCCCCCCCCCCCCCCCCCCCCCCC
    
401  ,,,,,,,,,,,,,,,,,,,,,,,,,,,,Installed,Active,,,,,41474.0,, 458
NOTE:
127353 records were read from the infile (system-specific pathname).
      The minimum record length was
0.
      The maximum record length was
2652.
NOTE: DATA statement used (Total process time):
      real time          
0.59 seconds
      cpu time           
0.56 seconds
asishgautam
Calcite | Level 5

Hi,

I am using 9.3.  The easiest way to identify a problem record is too look at the first two columns (application number and program administrator).  An invalid record will show up as a "message".  Some problem records are in rows (as displayed by notepad++):

  10439

  19204

  19799

  24719

  24743

Tom
Super User Tom
Super User

For this file you can try to fix it by concatenating lines until the quotes balance.  When I try that it reduces the number of lines by 1,168.

This program will replace the extra line breaks with vertical bar '|'.  You could use another character or just not put in anything.

filename new temp;

data _null_ ;

  if eof then put 'NOTE: Records read=' newn 'Records with missing quotes=' missq ;

  infile 'c:\downloads\WorkingDataSet_1-09-2013.csv'  dlm=',' dsd truncover

    lrecl=10000  termstr=crlf

    firstobs=1

    end=eof

  ;

  file new lrecl=10000;

  nq=0;

  do until (mod(nq,2)=0 or eof );

     input;

     newn+1;

     nq = nq + countc(_infile_,'"');

     put _infile_ @;

     if mod(nq,2) then do;

       missq+1;

       put '|' @;

     end;

  end;

  put;

run;

asishgautam
Calcite | Level 5

Thank you so much!  The text file created by your program worked just right!  I was hoping if its not too much trouble, if you could walk me through what your code is doing so I may understand it better.

Tom
Super User Tom
Super User

The IF EOF  line is just to produce a report of some summary numbers it has calculated (see the END= option on the INFILE statement). I like to put these type of summary reports at the top of the data step so that it can handle empty files. Also so that the PUT statements for the report are before any FILE statements to insure the report is in the LOG instead of appended to the output file.

The INFILE statement points to the source file.  You really only need the LRECL and END options for this program. The others were there from when I was trying to read the file without modifying it.  The LRECL is because SAS's defaule linesize is very small (256).  You should set the LRECL large enough for handle the incoming data. Go ahead and set it way larger than you actually think you need.  The END option creates a variable that will be set to 1 (TRUE) when the end of the infile has been reached.

The FILE statement defines where to write the modified version of the file. Make sure to set the LRECL long enough for the concatenated lines. Check the NOTES that SAS puts in the log at the end of data step.  You can set it larger than you need without any issue.  (The FILENAME statement is just to make a temporary file for testing. You will probably want to just reference the physical filename in the FILE statement the same as you had in your INFILE statement before.)

NQ is a counter variable to see how many quotes are in the current output line. 

The DO loop will run until the quotes are balanced. That is the number of them is even.  That is the remainder when divided by 2 is zero.

The INPUT statement without any variables will read in the next line from the input file and store it in the automatic variable _INFILE_.

NEWN+1 will keep a running count of how many lines have been read.  This syntax is called a sum statement. It is equivalent to the two statements 'RETAIN NEWN 0' and 'NEWN = SUM(NEWN,1)' .

COUNTC(_INFILE_,'"') will count how many times the double quote character appears in the file.  This is added to the current value of NQ.

PUT _INFILE_ @ will write the current line to the output file. The trailing @ will hold the pointer at the end in case we need to append another line.

The IF/THEN block here will check if the quotes are currently unbalanced. If so it increments the MISSQ counter and puts out a vertical bar (|) to mark where the line break used to be. 

The last PUT statement writes the end of the line character(s) to the output file to close the line that we left open with the trailing @ sign on the put statements inside of the main DO loop.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 5638 views
  • 0 likes
  • 3 in conversation