Hi,
I have a raw data that some records move jump to second line (like below). The third line below is the continuation of the address for line 2.
When I import the data using import wizard from EG, I was able to read this correctly. However, when I do a "Generalize Import Step to run outside of EG", the code doesn't work and I still get the third line below read as third record and had invalid info for second record.
Is there anything I could do in the import sas base code to have this read correctly?
Thank you
288,10006S,Mrs.,J,Guildford,,I,No,No,XXXXXX Street,Bridgetown,NS,B0S 1C0,CA
289,10008S,Mr.,B,Len,,I,No,No,"200 Forties Rd
PO Box 128",New Ross,NS,B0J 2M0,CA
The EG import wizard should have generated appropriate code that could be exported from EG or possibly found in the Log. Did you import from actual CSV or Excel? The symptom you show is common for people using the Alt-Enter to create a multiline entry in an Excel cell.
You should show the code you have attempted.
Hi,
Thank you for your reply. Below is the code generated from EG. I changed to infile path and dataset name to a valid name and that did not work.
data have;
length temp $ 8000;
retain temp;
input ;
temp=cats(temp,_infile_);
if countc(temp,',')=13 then do;
var1=scan(temp,1,',');
var2=scan(temp,2,',');
/*
var3=.......
var4=.....
*/
output;
call missing(temp);
end;
cards;
288,10006S,Mrs.,J,Guildford,,I,No,No,XXXXXX Street,Bridgetown,NS,B0S 1C0,CA
289,10008S,Mr.,B,Len,,I,No,No,"200 Forties Rd
PO Box 128",New Ross,NS,B0J 2M0,CA
;
run;
Hi,
Thank you so much! This works. However, I need a little clarification since I am not that familiar with cards.
If I actually read this data from a csv file, where and how do I reference that csv file in this code?
Thank you again
Sure.
data have;
infile 'c:\temp\temp.csv';
length temp $ 8000;
retain temp;
input ;
temp=cats(temp,_infile_);
if countc(temp,',')=13 then do;
var1=scan(temp,1,',');
var2=scan(temp,2,',');
/*
var3=.......
var4=.....
*/
output;
call missing(temp);
end;
run;
Thank you so much again for a quick reply. I am almost there. However, for some reasons, this code only read 13 records from the csv file. Why is that? The actual # of records on the file is 650k.
Also, how do I remove "" around each var?
Thank you!
@devon59 wrote:
Thank you so much again for a quick reply. I am almost there. However, for some reasons, this code only read 13 records from the csv file. Why is that? The actual # of records on the file is 650k.
Also, how do I remove "" around each var?
Thank you!
Copy your code and notes from the log. Paste into a code box opened on the forum using the {I} menu icon.
I suspect something other than just infile, informat, format and input statements. Something involving an IF perhaps.
data have; infile 'C:\temp.csv' delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; length temp $ 8000; retain temp; input ; if _n_ <= 20 then list ; temp=cats(temp,_infile_); if COUNTW(temp,',','mq') >= 14 then do; var1=scan(temp,1,','); var2=scan(temp,2,','); var3=scan(temp,3,','); var4=scan(temp,4,','); var5=scan(temp,5,','); var6=scan(temp,6,','); var7=scan(temp,7,','); var8=scan(temp,8,','); var9=scan(temp,9,','); var10=scan(temp,10,','); var11=scan(temp,11,','); var12=scan(temp,12,','); var13=scan(temp,13,','); /* var3=....... var4=..... */ output; call missing(temp); end; run; 363 data have; 364 infile 'C:\temp.csv' delimiter = ',' 364! MISSOVER DSD lrecl=32767 firstobs=2 ; 365 length temp $ 8000; 366 retain temp; 367 input ; 368 if _n_ <= 20 then list ; 369 temp=cats(temp,_infile_); 370 if COUNTW(temp,',','mq') >= 14 then do; 371 var1=scan(temp,1,','); 372 var2=scan(temp,2,','); 373 var3=scan(temp,3,','); 374 var4=scan(temp,4,','); 375 var5=scan(temp,5,','); 376 var6=scan(temp,6,','); 377 var7=scan(temp,7,','); 378 var8=scan(temp,8,','); 379 var9=scan(temp,9,','); 380 var10=scan(temp,10,','); 381 var11=scan(temp,11,','); 382 var12=scan(temp,12,','); 383 var13=scan(temp,13,','); 384 /* 385 var3=....... 386 var4=..... 387 */ 388 389 output; 390 call missing(temp); 391 end; 392 run; NOTE: The infile 'C:\temp.csv' is: Filename=C:\temp.csv, RECFM=V,LRECL=32767,File Size (bytes)=69042694, Last Modified=27Jul2018:10:15:14, Create Time=27Jul2018:10:15:01 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+ 2 "282","10000S","Mr.","Donald","Mundle","","I","No","No","PO Box 183","Pugwash","NS"," 86 B0K 1L0","CA" 98 3 "283","10001S","Mr.","Florence","Raney","","I","No","No","RR 1","Lower L'ardoise","NS 86 ","B0E 1W0","CA" 101 4 "284","10002S","Miss","Kaye","Martin","","I","No","No","PO Box 166","Whycocomagh","NS 86 ","B0E 3M0","CA" 101 5 "285","10003S","Mr.","Grant","Goodine","","I","No","No","PO Box 670","Bridgetown","NS 86 ","B0S 1C0","CA" 101 6 "286","10004S","Mrs.","Myrtle","Macumber","","I","No","No","RR 1","Walton","NS","B0N RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+ 86 2R0","CA" 94 7 "287","10005S","Ms.","Daisy","MacPhee","","I","No","No","PO Box 28","Shubenacadie","N 86 S","B0N 2H0","CA" 102 8 "288","10006S","Mrs.","Joyce","Guild","","I","No","No","Jeffery Street","Bridgetown", 86 "NS","B0S 1C0","CA" 104 9 "289","10008S","Mr.","Basil","Lenihan","","I","No","No","200 Forties Rd 71 10 PO Box 128","New Ross","NS","B0J 2M0","CA" 42 11 "290","10009S","Mr.","George","Melanson","","I","No","Yes","7093 Hwy 101","Plympton", 86 "NS","B0W 2R0","CA" 104 12 "291","1000S","Mr.","Gregory","Dugas","","I","No","No","General Delivery","Alder Poin 86 t","NS","B0C 1A0","CA" 107 13 "292","10010S","Mr.","Kenneth","d'Entremont","","I","No","No","PO Box 127","Lowere We 86 st Pubnico","NS","B0W 2C0","CA" 116 14 "293","10011S","","","","Weymouth Drug Store Ltd.","O","No","No","PO Box 119","Weymou 86 th","NS","B0W 3T0","CA" 108 15 "295","10012S","Mrs.","Lydia","Brewer","","I","No","No","PO Box 688","Bridgetown","NS 86 ","B0S 1C0","CA" 101 16 "296","10013S","Mrs.","Gladys","McLaughlin","","I","No","No","Centennial Apt.21, Bo", 86 "Bridgetown","NS","B0S 1C0","CA" 117 17 "297","10014S","Ms.","Jane","Wightman","","I","No","No","64 Granville St W","Bridgeto 86 wn","NS","B0S 1C0","CA" 108 18 "298","10016S","Ms.","Anna","MacCarthy","","I","No","No","PO Box 1006","Windsor","NS" 86 ,"B0N 2T0","CA" 100 19 "299","10017S","Ms.","Kathleen","Barr","","I","No","No","RR 2","Weymouth","NS","B0W 3 86 T0","CA" 93 20 "300","10019S","Mrs.","Elizabeth","Gordon","","I","No","No","RR 1 Po","Milford Statio 86 n","NS","B0N 1Y0","CA" 107 21 "301","1001S","Mrs.","Wong","Lee","","I","No","No","6264 North Street","Halifax","NS" 86 ,"B3L 1P5","CA" 100 NOTE: 654273 records were read from the infile 'C:\temp.csv'. The minimum record length was 0. The maximum record length was 223. NOTE: The data set WORK.HAVE has 112514 observations and 14 variables. NOTE: DATA statement used (Total process time): real time 1:39.06 cpu time 22.93 seconds
Search on this forum as this question has been answered many times already.
Here is a synopsis of best solutions in order or prefernce.
1) Fix it at the source if you can. That is have whoever gave your the file with embedded line breaks create version that does not have the embedded line breaks.
2) See if SAS can handle it anyway. If you are lucky the lines are terminated with CR+LF characters and the embedded line breaks are using just LF (or just CR). If so just adding the TERMSTR=CRLF option to your INFILE statement will let SAS read it without any extra work.
3) Run a data step to copy the original file to a new fixed file. This is where you might try to use the COUNTW() trick to figure out when you have embedded line breaks and remove them. Another method is to count the quotes on the line and if a line break occurs when the number of quotes is odd then remove the line break. Once you have a fixed version of the file with only one line per observation then just read it normally using data step. Or you could even let PROC IMPORT try to guess what the variable names and types are for you.
Looks promising. You should probably use the COUNTW() function instead of the COUNTC() function since then you could add the Q and M modifiers to handle quoted delimiters and empty fields properly.
Hi,
I still could not figure out why the code just read only the first 13 records. Could you please help? How could I read the whole data from csv file with 650k records with this code?
Thank you
No idea. Show us your SAS log.
Copy the relevant lines from your log and paste them into the forum using the window that pops-up when you click on the {i} icon on the menu bar.
there's no error in log. It just 654273 records were read but output showed 13 records.
Thank you
334 data have;
335 infile 'C:\temp.csv' delimiter = ','
335! MISSOVER DSD lrecl=32767 firstobs=2 ;
336 length temp $ 8000;
337 retain temp;
338 input ;
339 temp=cats(temp,_infile_);
340 if countc(temp,',')=13 then do;
341 var1=scan(temp,1,',');
342 var2=scan(temp,2,',');
343 var3=scan(temp,3,',');
344 var4=scan(temp,4,',');
345 var5=scan(temp,5,',');
346 var6=scan(temp,6,',');
347 var7=scan(temp,7,',');
348 var8=scan(temp,8,',');
349 var9=scan(temp,9,',');
350 var10=scan(temp,10,',');
351 var11=scan(temp,11,',');
352 var12=scan(temp,12,',');
353 var13=scan(temp,13,',');
354 /*
355 var3=.......
356 var4=.....
357 */
358
359 output;
360 call missing(temp);
361 end;
362 run;
NOTE: The infile 'C:\temp.csv' is:
Filename=C:\temp.csv,
RECFM=V,LRECL=32767,File Size (bytes)=69042694,
Last Modified=27Jul2018:10:15:14,
Create Time=27Jul2018:10:15:01
NOTE: 654273 records were read from the infile 'C:\temp.csv'.
The minimum record length was 0.
The maximum record length was 223.
NOTE: The data set WORK.HAVE has 13 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 11.60 seconds
cpu time 11.54 seconds
Sounds like you rarely generated lines with 13 commas in them. A line with 13 commas would have 14 fields on it. Perhaps 13 was not the right number to look for?
As I said before it is probably better to count the number of values on the line instead of just the number of delimiter characters.
So try using COUNTW(temp,',','mq') >= 14 .
Since we don't know what your text file looks like you will need to debug this. One thing you could do is tell your existing data step to dump some information to the log. If you add this line just after the INPUT statement then you should see the top 20 lines of the file in the log.
if _n_ <= 20 then list ;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.