Hi, How do I read nested double quotation in txt files and txt files as such? P.S the highlighted name.
for example the txt is as such:
"row.names","pclass","survived","name","age","embarked","home.dest","room","ticket","boat","sex"
"37","1st",1,"Brown, Mrs James Joseph (Margaret "Molly" Tobin)",44.0000,"Cherbourg","Denver, CO","","17610 L27 15s 5d","6","female"
I have attached the file below. Thank you much for help.
DSD option in an infile statement should be fine.
Your delimiter is a comma otherwise so it should read properly. I think...
What errors are you getting?
@Reeza wrote:
DSD option in an infile statement should be fine.
Your delimiter is a comma otherwise so it should read properly. I think...
What errors are you getting?
No errors happen, but something quite peculiar:
data test;
infile cards dlm=',' dsd truncover;
input rownames $ pclass $ survived name :$50. age embarked $ homedest $ room $ ticket $ boat sex $;
cards;
"37","1st",1,"Brown, Mrs James Joseph (Margaret "Molly" Tobin)",44.0000,"Cherbourg","Denver, CO","","17610 L27 15s 5d","6","female"
;
run;
proc print noobs;run;
Result:
rownames pclass survived name age embarked homedest room ticket boat sex 37 1st 1 "Brown . 44.0000 Cherbour Denver, . 6
As you can see, SAS interprets the comma within the double quotes as a delimiter; everything after that up to the next comma (disregards all further double quotes) ends up as input for the next column, causing a shift and missing values.
My tests with 3 different programs (SAS, Excel, LibreOffice calc) have revealed that the only software that reads this correctly is LibreOffice. Not really surprised by that, but SAS should do better.
I'd put that to SAS TS.
Replacing the double quotes around the name with single quotes made SAS read this as intended, BTW.
Edit: replaced "commas" in one place with "double quotes".
The problem is that the source file is poorly formed so that there is no unambiguous way to parse it. It has a quoted value with embedded quotes that are not doubled up. If you want to put the text
Brown, Mrs James Joseph (Margaret "Molly" Tobin)
in quotes you either need to use single quotes.
'Brown, Mrs James Joseph (Margaret "Molly" Tobin)'
Or double the existing quotes.
"Brown, Mrs James Joseph (Margaret ""Molly"" Tobin)"
How about this one .
filename x '/folders/myfolders/titanicExample.txt';
filename y temp;
data _null_;
infile x lrecl=2000 length=len;
file y lrecl=2000;
input x $varying2000. len;
p='s/^"|"$|"(?=,)|(?<=,)"'||"/'/";
x=prxchange(p,-1,strip(x));
put x;
run;
proc import datafile=y dbms=csv out=have replace;
guessingrows=max;
run;
Where did you get the file? It looks like something exported from Excel. It has decided that as the delimiter appears in the value that text strings need to be quoted. Unfortunately your string also contains quote marks. Normally I would expect to see either sinlge quotes surrounding the outside, or double quotes inside (also seen tags as well).
I would suggesting fixing the source, i.e. export as pipe delimted or similar:
"37"|"1st"|1|"Brown, Mrs James Joseph (Margaret "Molly" Tobin)"|44.0000|...
@Reeza, nope, it doesn't due to the comman in the string also.
How very interesting. 'dsd' doesn't do what you want exactly, but SAS's behaviour with imbedded quotes is - well - odd.
When I nutted out some code, I got what I expect you were getting: Molly's name confuses everything and the following columns get out of whack. What I didn't expect is actually how 'name' would be treated. I would have thought that name would contain "Brown, Mrs James Joseph (Margaret " and then stop. What actually occurred is name is truncated after the first comma: "Brown - note that it contains the double quote.
I'm not inordinately proud of this code, but it does work for the example you provided. There may be other errors in following data - but this is a reasonable first attempt:
data titanic; infile titanic firstobs=2 dsd dlm=',' missover; attrib row_names length=$ 6 label="Row Names"; attrib pclass length=$ 6 label="Pclass"; attrib survived length=4 label="Survived"; attrib name length=$ 60 label="Name"; attrib age length=4 label="Age"; attrib embarked length=$ 30 label="Embarked"; attrib home_dest length=$ 30 label="Home Dest"; attrib room length=$ 6 label="Room"; attrib ticket length=$ 30 label="Ticket"; attrib boat length=$ 6 label="Boat"; attrib sex length=$ 6 label="Sex"; attrib name2 length=$ 60;
input row_names @; if missing(row_names) then delete; input pclass survived name @; if name =: '"' then do; input name2 @; name = strip(substr(name, 2)) || ', ' || substr(name2, 1, length(name2) - 1); end; input age ?? embarked home_dest room ticket boat sex; drop name2; run;
Note the checking of name to see if it starts with a quote. If it does, it reads name2, then concatenates it with name (without its first character). Note also that name2 is suffixed with a quote, but the internal quotes around Molly are retained.
I also put double question-marks after age, to treat NA as missing without an error.
What is the source of that file? Looks like some program tried to write a CSV file, but did not follow the rules for how to handle values with quotes in them. There are many popular computer languages (Oracle comes to mind) that frequently generate gibberish like that.
Can you get it in a form that is not corrupted?
@wilsonli wrote:
Hi, so do you think "Brown, Mrs James Joseph (Margaret "Molly" Tobin)" is actually corrupted? So do you think it is a good practice to manually fix the raw data from "Molly" to 'Molly' or Molly inside the double quotes before reading it? Thanks.
No. It is best practice to write the text file properly the first time.
It might be possible to fix your existing file. But usually that is only possible when you know how many fields are on each line and there is only one field that could possibly have embedded quotes. Then you can parse the fields before from the left and the fields after from the right and whatever is left over is what goes into the problem field.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.