BookmarkSubscribeRSS Feed
devon59
Calcite | Level 5

Hi,

I have a raw data that some records move jump to second line (like below). The third line below is the continuation of the address for line 2.

 

When I import the data using import wizard from EG, I was able to read this correctly.  However, when I do a "Generalize Import Step to run outside of EG", the code doesn't work and I still get the third line below read as third record and had invalid info for second record.

 

Is there anything I could do in the import sas base code to have this read correctly?

 

Thank you

 

288,10006S,Mrs.,J,Guildford,,I,No,No,XXXXXX Street,Bridgetown,NS,B0S 1C0,CA
289,10008S,Mr.,B,Len,,I,No,No,"200 Forties Rd
PO Box 128",New Ross,NS,B0J 2M0,CA

17 REPLIES 17
ballardw
Super User

The EG import wizard should have generated appropriate code that could be exported from EG or possibly found in the Log. Did you import from actual CSV or Excel? The symptom you show is common for people using the Alt-Enter to create a multiline entry in an Excel cell.

 

You should show the code you have attempted.

 

devon59
Calcite | Level 5

Hi,

Thank you for your reply.  Below is the code generated from EG.  I changed to infile path and dataset name to a valid name and that did not work.

 

DATA WORK.'AZ_CRC FILE AUDIT _ CONSTIT_0000'n;
    LENGTH
        consid             8
        cnbio_id         $ 9
        title            $ 27
        fname            $ 49
        lname            $ 50
        company          $ 60
        keyind           $ 1
        deceased         $ 3
        novalidadd       $ 3
        address          $ 150
        city             $ 49
        province         $ 3
        postal           $ 11
        country          $ 3 ;
    LABEL
        consid           = "Constituent_ConstituentID"
        cnbio_id         = "ConID"
        title            = "Title"
        fname            = "FirstName"
        lname            = "LastName"
        company          = "OrgName"
        keyind           = "KeyIndicator"
        deceased         = "Deceased"
        novalidadd       = "HasNoValidAddresses"
        address          = "AddressBlock"
        city             = "City"
        province         = "Province"
        postal           = "PostalCode"
        country          = "Country" ;
    FORMAT
        consid           BEST7.
        cnbio_id         $CHAR9.
        title            $CHAR50.
        fname            $CHAR50.
        lname            $CHAR79.
        company          $CHAR60.
        keyind           $CHAR1.
        deceased         $CHAR3.
        novalidadd       $CHAR3.
        address          $CHAR150.
        city             $CHAR49.
        province         $CHAR3.
        postal           $CHAR12.
        country          $CHAR5. ;
    INFORMAT
        consid           BEST6.
        cnbio_id         $CHAR9.
        title            $CHAR27.
        fname            $CHAR49.
        lname            $CHAR50.
        company          $CHAR60.
        keyind           $CHAR1.
        deceased         $CHAR3.
        novalidadd       $CHAR3.
        address          $CHAR150.
        city             $CHAR49.
        province         $CHAR3.
        postal           $CHAR11.
        country          $CHAR3. ;
    INFILE 'C:\Users\bsimmons\AppData\Local\Temp\SEG9128\AZ-CRC File Audit - Constituents-726a9540afd24db2a7e2ae0eb592284d.txt'
        LRECL=245
        ENCODING="WLATIN1"
        TERMSTR=CRLF
        DLM='7F'x
        MISSOVER
        DSD ;
    INPUT
        consid           : ?? BEST6.
        cnbio_id         : $CHAR9.
        title            : $CHAR27.
        fname            : $CHAR49.
        lname            : $CHAR50.
        company          : $CHAR60.
        keyind           : $CHAR1.
        deceased         : $CHAR3.
        novalidadd       : $CHAR3.
        address          : $CHAR150.
        city             : $CHAR49.
        province         : $CHAR3.
        postal           : $CHAR11.
        country          : $CHAR3. ;
RUN;
Ksharp
Super User
data have;
length temp $ 8000;
retain temp;
input ;
temp=cats(temp,_infile_);
if countc(temp,',')=13 then do;
var1=scan(temp,1,',');
var2=scan(temp,2,',');
/*
var3=.......
var4=.....
*/

output;
call missing(temp);
end;
cards;
288,10006S,Mrs.,J,Guildford,,I,No,No,XXXXXX Street,Bridgetown,NS,B0S 1C0,CA
289,10008S,Mr.,B,Len,,I,No,No,"200 Forties Rd
PO Box 128",New Ross,NS,B0J 2M0,CA
;
run;
devon59
Calcite | Level 5

Hi,

Thank you so much!  This works.  However, I need a little clarification since I am not that familiar with cards.

 

If I actually read this data from a csv file, where and how do I reference that csv file in this code?

 

Thank you again

Ksharp
Super User

Sure.

data have;
infile 'c:\temp\temp.csv';
length temp $ 8000;
retain temp;
input ;
temp=cats(temp,_infile_);
if countc(temp,',')=13 then do;
var1=scan(temp,1,',');
var2=scan(temp,2,',');
/*
var3=.......
var4=.....
*/

output;
call missing(temp);
end;

run;
devon59
Calcite | Level 5

Thank you so much again for a quick reply.  I am almost there.  However, for some reasons, this code only read 13 records from the csv file.  Why is that?  The actual # of records on the file is 650k.

 

Also, how do I remove "" around each var?

 

Thank you!

ballardw
Super User

@devon59 wrote:

Thank you so much again for a quick reply.  I am almost there.  However, for some reasons, this code only read 13 records from the csv file.  Why is that?  The actual # of records on the file is 650k.

 

Also, how do I remove "" around each var?

 

Thank you!


Copy your code and notes from the log. Paste into a code box opened on the forum using the {I} menu icon.

I suspect something other than just infile, informat, format and input statements. Something involving an IF perhaps.

devon59
Calcite | Level 5
data have;
infile 'C:\temp.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
length temp $ 8000;
retain temp;
input ;
if _n_ <= 20 then list ;
temp=cats(temp,_infile_);
if COUNTW(temp,',','mq') >= 14 then do;
var1=scan(temp,1,',');
var2=scan(temp,2,',');
var3=scan(temp,3,',');
var4=scan(temp,4,',');
var5=scan(temp,5,',');
var6=scan(temp,6,',');
var7=scan(temp,7,',');
var8=scan(temp,8,',');
var9=scan(temp,9,',');
var10=scan(temp,10,',');
var11=scan(temp,11,',');
var12=scan(temp,12,',');
var13=scan(temp,13,',');
/*
var3=.......
var4=.....
*/

output;
call missing(temp);
end;
run;


363  data have;
364  infile 'C:\temp.csv' delimiter = ','
364! MISSOVER DSD lrecl=32767 firstobs=2 ;
365  length temp $ 8000;
366  retain temp;
367  input ;
368  if _n_ <= 20 then list ;
369  temp=cats(temp,_infile_);
370  if COUNTW(temp,',','mq') >= 14 then do;
371  var1=scan(temp,1,',');
372  var2=scan(temp,2,',');
373  var3=scan(temp,3,',');
374  var4=scan(temp,4,',');
375  var5=scan(temp,5,',');
376  var6=scan(temp,6,',');
377  var7=scan(temp,7,',');
378  var8=scan(temp,8,',');
379  var9=scan(temp,9,',');
380  var10=scan(temp,10,',');
381  var11=scan(temp,11,',');
382  var12=scan(temp,12,',');
383  var13=scan(temp,13,',');
384  /*
385  var3=.......
386  var4=.....
387  */
388
389  output;
390  call missing(temp);
391  end;
392  run;

NOTE: The infile 'C:\temp.csv' is:
      Filename=C:\temp.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=69042694,
      Last Modified=27Jul2018:10:15:14,
      Create Time=27Jul2018:10:15:01

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
2         "282","10000S","Mr.","Donald","Mundle","","I","No","No","PO Box 183","Pugwash","NS","
      86  B0K 1L0","CA" 98
3         "283","10001S","Mr.","Florence","Raney","","I","No","No","RR 1","Lower L'ardoise","NS
      86  ","B0E 1W0","CA" 101
4         "284","10002S","Miss","Kaye","Martin","","I","No","No","PO Box 166","Whycocomagh","NS
      86  ","B0E 3M0","CA" 101
5         "285","10003S","Mr.","Grant","Goodine","","I","No","No","PO Box 670","Bridgetown","NS
      86  ","B0S 1C0","CA" 101
6         "286","10004S","Mrs.","Myrtle","Macumber","","I","No","No","RR 1","Walton","NS","B0N
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
      86  2R0","CA" 94
7         "287","10005S","Ms.","Daisy","MacPhee","","I","No","No","PO Box 28","Shubenacadie","N
      86  S","B0N 2H0","CA" 102
8         "288","10006S","Mrs.","Joyce","Guild","","I","No","No","Jeffery Street","Bridgetown",
      86  "NS","B0S 1C0","CA" 104
9         "289","10008S","Mr.","Basil","Lenihan","","I","No","No","200 Forties Rd 71
10        PO Box 128","New Ross","NS","B0J 2M0","CA" 42
11        "290","10009S","Mr.","George","Melanson","","I","No","Yes","7093 Hwy 101","Plympton",
      86  "NS","B0W 2R0","CA" 104
12        "291","1000S","Mr.","Gregory","Dugas","","I","No","No","General Delivery","Alder Poin
      86  t","NS","B0C 1A0","CA" 107
13        "292","10010S","Mr.","Kenneth","d'Entremont","","I","No","No","PO Box 127","Lowere We
      86  st Pubnico","NS","B0W 2C0","CA" 116
14        "293","10011S","","","","Weymouth Drug Store Ltd.","O","No","No","PO Box 119","Weymou
      86  th","NS","B0W 3T0","CA" 108
15        "295","10012S","Mrs.","Lydia","Brewer","","I","No","No","PO Box 688","Bridgetown","NS
      86  ","B0S 1C0","CA" 101
16        "296","10013S","Mrs.","Gladys","McLaughlin","","I","No","No","Centennial Apt.21, Bo",
      86  "Bridgetown","NS","B0S 1C0","CA" 117
17        "297","10014S","Ms.","Jane","Wightman","","I","No","No","64 Granville St W","Bridgeto
      86  wn","NS","B0S 1C0","CA" 108
18        "298","10016S","Ms.","Anna","MacCarthy","","I","No","No","PO Box 1006","Windsor","NS"
      86  ,"B0N 2T0","CA" 100
19        "299","10017S","Ms.","Kathleen","Barr","","I","No","No","RR 2","Weymouth","NS","B0W 3
      86  T0","CA" 93
20        "300","10019S","Mrs.","Elizabeth","Gordon","","I","No","No","RR 1 Po","Milford Statio
      86  n","NS","B0N 1Y0","CA" 107
21        "301","1001S","Mrs.","Wong","Lee","","I","No","No","6264 North Street","Halifax","NS"
      86  ,"B3L 1P5","CA" 100
NOTE: 654273 records were read from the infile 'C:\temp.csv'.
      The minimum record length was 0.
      The maximum record length was 223.
NOTE: The data set WORK.HAVE has 112514 observations and 14 variables.
NOTE: DATA statement used (Total process time):
      real time           1:39.06
      cpu time            22.93 seconds


Tom
Super User Tom
Super User

Search on this forum as this question has been answered many times already.

Here is a synopsis of best solutions in order or prefernce.

1) Fix it at the source if you can. That is have whoever gave your the file with embedded line breaks create version that does not have the embedded line breaks.

2) See if SAS can handle it anyway.  If you are lucky the lines are terminated with CR+LF characters and the embedded line breaks are using just LF (or just CR).  If so just adding the TERMSTR=CRLF option to your INFILE statement will let SAS read it without any extra work.

3) Run a data step to copy the original file to a new fixed file.  This is where you might try to use the COUNTW() trick to figure out when you have embedded line breaks and remove them. Another method is to count the quotes on the line and if a line break occurs when the number of quotes is odd then remove the line break.  Once you have a fixed version of the file with only one line per observation then just read it normally using data step. Or you could even let PROC IMPORT try to guess what the variable names and types are for you.

 

 

Tom
Super User Tom
Super User

Looks promising.  You should probably use the COUNTW() function instead of the COUNTC() function since then you could add the Q and M modifiers to handle quoted delimiters and empty fields properly. 

devon59
Calcite | Level 5

Hi,

I still could not figure out why the code just read only the first 13 records.  Could you please help?  How could I read the whole data from csv file with 650k records with this code?  

 

Thank you

Tom
Super User Tom
Super User

No idea.  Show us your SAS log.

 

Copy the relevant lines from your log and paste them into the forum using the window that pops-up when you click on the {i} icon on the menu bar.

 

devon59
Calcite | Level 5

there's no error in log.  It just 654273 records were read but output showed 13 records.

 

Thank you

 

334 data have;
335 infile 'C:\temp.csv' delimiter = ','
335! MISSOVER DSD lrecl=32767 firstobs=2 ;
336 length temp $ 8000;
337 retain temp;
338 input ;
339 temp=cats(temp,_infile_);
340 if countc(temp,',')=13 then do;
341 var1=scan(temp,1,',');
342 var2=scan(temp,2,',');
343 var3=scan(temp,3,',');
344 var4=scan(temp,4,',');
345 var5=scan(temp,5,',');
346 var6=scan(temp,6,',');
347 var7=scan(temp,7,',');
348 var8=scan(temp,8,',');
349 var9=scan(temp,9,',');
350 var10=scan(temp,10,',');
351 var11=scan(temp,11,',');
352 var12=scan(temp,12,',');
353 var13=scan(temp,13,',');
354 /*
355 var3=.......
356 var4=.....
357 */
358
359 output;
360 call missing(temp);
361 end;
362 run;

NOTE: The infile 'C:\temp.csv' is:
Filename=C:\temp.csv,
RECFM=V,LRECL=32767,File Size (bytes)=69042694,
Last Modified=27Jul2018:10:15:14,
Create Time=27Jul2018:10:15:01

NOTE: 654273 records were read from the infile 'C:\temp.csv'.
The minimum record length was 0.
The maximum record length was 223.
NOTE: The data set WORK.HAVE has 13 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 11.60 seconds
cpu time 11.54 seconds

Tom
Super User Tom
Super User

Sounds like you rarely generated lines with 13 commas in them.  A line with 13 commas would have 14 fields on it.  Perhaps 13 was not the right number to look for?

 

As I said before it is probably better to count the number of values on the line instead of just the number of delimiter characters. 

So try using COUNTW(temp,',','mq') >= 14 .

 

Since we don't know what your text file looks like you will need to debug this.  One thing you could do is tell your existing data step to dump some information to the log.  If you add this line just after the INPUT statement then you should see the top 20 lines of the file in the log.

if _n_ <= 20 then list ;

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 3694 views
  • 0 likes
  • 4 in conversation