hi, below are my codes of importing a text files. One of the value from the name variabes consists of " " quotation mark which i need to output it. But i cant seems to get that output using the ~ function. As the example below, my name variables must read the quotation mark --> Brown, Mrs James Joseph (Margaret "Molly" Tobin) .
example this row :
"37","1st",1,"Brown, Mrs James Joseph (Margaret "Molly" Tobin)",44.0000,"Cherbourg","Denver, CO","","17610 L27 15s 5d","6","female"
------------------------------------------------------------------------------------------------------
length row_names 3 pclass $3 survived 3 name $82 _age $3 age 3
embarked $3 home_dest $70 room $5 ticket $20 boat $3 sex $10;
infile '/home/yewkeong0/assignemnt/titanic.txt' dsd dlm=',' missover;
input row_names pclass $ survived name ~ $82. _age
embarked $ home_dest $ room $ ticket $ boat $ sex $;
if _age = 'NA' then age = . ;
else age = input(_age , 3.);
drop _age;
run;
any suggestion guys??? =/
If you don't have many of these issues it may be better to just change the "Molly" to 'Molly' to read the file. Then if you really really need double quotes you could replace the ' with ".
Another option would be if the input file was exported from another program to see if there would be if the data could be exported witha different delimiter such as | or even connect using ODBC.
Or use _infile_ and spend hours if not days writing code to parse a file line by line to implement rules that resolve to fields are separated by comma unless the comma occurs between two " but if there is more words without a comma but have more" keep reading.
I wouldn't spend the time.
If you have a lot of files doing this then you need to go to the source and find out why and fix it there. Perhaps have them actually use left and right quotes instead of ".
your row example is:
"37","1st",1,"Brown, Mrs James Joseph (Margaret "Molly" Tobin)",44.0000,"Cherbourg","Denver, CO","","17610 L27 15s 5d","6","female"
your problem is because there is a comma in the name field which splits it into two variables.
using DLM=',' is not good enough for your data and input method.
The issue is the embedded double quote characters. Like:
"Margaret "Molly" Tobin"
If you want to include the quoting characters into a quoted string they need to be doubled so SAS can tell that they do not mark the end of the string.
"Margaret ""Molly"" Tobin"
Can you get the generator of the file to fix that bug?
You might be able to fix the issue. Probably it would take a human to figure out what was intended for some of the more messy strings. But if you are limited to just this type of issue and only for one particular field you might be able to make a program to fix it. For example you could use the COUNTW() function to see how many fields SAS sees in the line and if it is more than expected then you know there is a quoting issue.
The SCAN() and COUNTW() function can handle those incorrectly quoted fields. So use them to generate a new version of the file that properly doubles up the embedded quotes. This code will do that. It will also get rid of the many unneeded quotes that are in the original. You can then read the converted file using data step or even PROC IMPORT.
data _null_ ;
  infile 'badfile.csv' ;
  file 'goodfile.csv' dsd ;
  length word $200 ;
  input ;
  do i=1 to countw(_infile_,',','qm');
    word = scan(_infile_,i,',','qm');
    if word =: '"' then word=substrn(word,2,length(word)-2);
    put word @;
  end;
  put;
run;
Based on @Tom suggestion, attached is a tested code to read and create desired output:
data want;
       length row_names 3 pclass $3 survived 3 name $82 _age $3 age 3
                 embarked $3 home_dest $70 room $5 ticket $20 boat $3 sex $10; 
      infile cards truncover;
       length a_row $250 word $90 var1-var11 $90;
       a_row = ' ';
       input a_row $250.;
      array varx $ var1 - var11;
     do i=1 to countw(a_row,',','qm');
               word = scan(a_row,i,',','qm');
               if word =: '"' then word=substrn(word,2,length(word)-2);
               varx(i) = word; 
     end ;
 /* 1*/ row_names = input(left(varx(1)),best3.);
 /* 2*/ pclass = left(varx(2));
 /* 3*/ survived = input(left(varx(3)),best3.);
 /* 4*/ name = left(varx(4));
 /* 5*/ _age = left(varx(5));
       if _age = 'NA' then age = . ;
       else age = input(_age , 3.);
 
 /* 6*/ embarked = left(varx(6));
 /* 7*/ home_dest = left(varx(7));
 /* 8*/ room = left(varx(8));
 /* 9*/ ticket = left(varx(9));
 /*10*/ boat = left(varx(10));
 /*11*/ sex = left(varx(11));
 
 drop _age a_row word var1-var11;
/*-------------*/
cards;
"37","1st",1,"Brown, Mrs James Joseph (Margaret "Molly" Tobin)",44.0000,"Cherbourg","Denver, CO","","17610 L27 15s 5d","6","female"
run;
Did you check this ? https://communities.sas.com/t5/SAS-Studio/How-to-read-nested-double-quotation-in-txt-files/td-p/308211
OK. Appearntly Someone made such a file purposely .
Try this one.
filename x '/folders/myfolders/titanic.txt';
filename y temp;
data _null_;
 infile x lrecl=2000 length=len;
 file y lrecl=2000;
 input x $varying2000. len;
 p='s/,(?=")|(?<="),'||"/|/";
 x=prxchange(p,-1,strip(x));
 put x;
run;
proc import datafile=y dbms=csv out=have replace;
delimiter='|';
guessingrows=max;
run;
data have;
 set have;
 array x{*} $ _character_;
 do i=1 to dim(x);
  x{i}=prxchange('s/^"|"$//',-1,strip(x{i}));
 end;
drop i;
run;
proc print noobs;run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
