BookmarkSubscribeRSS Feed
Arcturuz
Calcite | Level 5

Hi, How do I read nested double quotation in txt files and txt files as such? P.S the highlighted name. 

for example the txt is as such: 

 

"row.names","pclass","survived","name","age","embarked","home.dest","room","ticket","boat","sex"

"37","1st",1,"Brown, Mrs James Joseph (Margaret "Molly" Tobin)",44.0000,"Cherbourg","Denver, CO","","17610 L27 15s 5d","6","female"

 

I have attached the file below. Thank you much for help.

13 REPLIES 13
Reeza
Super User

DSD option in an infile statement should be fine. 

Your delimiter is a comma otherwise so it should read properly. I think...

What errors are you getting?

Kurt_Bremser
Super User

@Reeza wrote:

DSD option in an infile statement should be fine. 

Your delimiter is a comma otherwise so it should read properly. I think...

What errors are you getting?


No errors happen, but something quite peculiar:

data test;
infile cards dlm=',' dsd truncover;
input rownames $ pclass $ survived name :$50. age embarked $ homedest $ room $ ticket $ boat sex $;
cards;
"37","1st",1,"Brown, Mrs James Joseph (Margaret "Molly" Tobin)",44.0000,"Cherbourg","Denver, CO","","17610 L27 15s 5d","6","female"
;
run;

proc print noobs;run;

Result:

rownames    pclass    survived     name     age    embarked    homedest     room      ticket    boat    sex

   37        1st          1       "Brown     .     44.0000     Cherbour    Denver,                .      6 

As you can see, SAS interprets the comma within the double quotes as a delimiter; everything after that up to the next comma (disregards all further double quotes) ends up as input for the next column, causing a shift and missing values.

 

My tests with 3 different programs (SAS, Excel, LibreOffice calc) have revealed that the only software that reads this correctly is LibreOffice. Not really surprised by that, but SAS should do better.

I'd put that to SAS TS.

 

Replacing the double quotes around the name with single quotes made SAS read this as intended, BTW.

 

Edit: replaced "commas" in one place with "double quotes".

Tom
Super User Tom
Super User

The problem is that the source file is poorly formed so that there is no unambiguous way to parse it.  It has a quoted value with embedded quotes that are not doubled up.  If you want to put the text  

Brown, Mrs James Joseph (Margaret "Molly" Tobin)

in quotes you either need to use single quotes.

'Brown, Mrs James Joseph (Margaret "Molly" Tobin)'

Or double the existing quotes.

"Brown, Mrs James Joseph (Margaret ""Molly"" Tobin)"
Ksharp
Super User

How about this one .

 

filename x '/folders/myfolders/titanicExample.txt';
filename y temp;
data _null_;
 infile x lrecl=2000 length=len;
 file y lrecl=2000;
 input x $varying2000. len;
 p='s/^"|"$|"(?=,)|(?<=,)"'||"/'/";
 x=prxchange(p,-1,strip(x));
 put x;
run;


proc import datafile=y dbms=csv out=have replace;
guessingrows=max; run;
wilsonli
Calcite | Level 5
Hello, pattern matching is actually working to resolve this issue, so how does pattern matching work? for example p='s/^"|"$|"(?=,)|(?<=,)"'||"/'/" in the code; how can I understand variable? could you plz explain a little bit for this? thanks
Ksharp
Super User
SAS documentation explain PRX very well . Check it.

^"|"$|"(?=,)|(?<=,)"
==>
^" stands for start with "
| stands for OR
"$ stands for end with "
"(?=,) is forward match like ",
(?<=,)" is backward match like ,"

/'/ stands for replace all the " above with '
wilsonli
Calcite | Level 5
thank you, seems like this solution can't keep consistency due to the corrupted data source. Because it corrupt the other row that have raw values like "Mike's..." after replacing double quotes to single quotes as it becomes 'Mike's...' and it will have the same issue again.

So my question is: in real world workplace, is it possible to manually fix raw data source from "Brown, Mrs James Joseph (Margaret "Molly" Tobin)" to "Brown, Mrs James Joseph (Margaret 'Molly' Tobin)" before reading it?
Ksharp
Super User
Please start a new session to discuss this question.
And other sas users could offer you a good idea / code !
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Where did you get the file?  It looks like something exported from Excel.  It has decided that as the delimiter appears in the value that text strings need to be quoted.  Unfortunately your string also contains quote marks.  Normally I would expect to see either sinlge quotes surrounding the outside, or double quotes inside (also seen tags as well).  

 

I would suggesting fixing the source, i.e. export as pipe delimted or similar:

"37"|"1st"|1|"Brown, Mrs James Joseph (Margaret "Molly" Tobin)"|44.0000|...

 

@Reeza, nope, it doesn't due to the comman in the string also.

LaurieF
Barite | Level 11

How very interesting. 'dsd' doesn't do what you want exactly, but SAS's behaviour with imbedded quotes is - well - odd.

 

When I nutted out some code, I got what I expect you were getting: Molly's name confuses everything and the following columns get out of whack. What I didn't expect is actually how 'name' would be treated. I would have thought that name would contain "Brown, Mrs James Joseph (Margaret " and then stop. What actually occurred is name is truncated after the first comma: "Brown - note that it contains the double quote.

 

I'm not inordinately proud of this code, but it does work for the example you provided. There may be other errors in following data - but this is a reasonable first attempt:

data titanic;
infile titanic 
       firstobs=2 dsd dlm=',' missover;
attrib row_names length=$ 6 label="Row Names";
attrib pclass length=$ 6 label="Pclass";
attrib survived length=4 label="Survived";
attrib name length=$ 60 label="Name";
attrib age length=4 label="Age";
attrib embarked length=$ 30 label="Embarked";
attrib home_dest length=$ 30 label="Home Dest";
attrib room length=$ 6 label="Room";
attrib ticket length=$ 30 label="Ticket";
attrib boat length=$ 6 label="Boat";
attrib sex length=$ 6 label="Sex";
attrib name2 length=$ 60; 
input row_names @; if missing(row_names) then delete; input pclass survived name @; if name =: '"' then do; input name2 @; name = strip(substr(name, 2)) || ', ' || substr(name2, 1, length(name2) - 1); end; input age ?? embarked home_dest room ticket boat sex; drop name2; run;

Note the checking of name to see if it starts with a quote. If it does, it reads name2, then concatenates it with name (without its first character). Note also that name2 is suffixed with a quote, but the internal quotes around Molly are retained.

 

I also put double question-marks after age, to treat NA as missing without an error.

Tom
Super User Tom
Super User

What is the source of that file?  Looks like some program tried to write a CSV file, but did not follow the rules for how to handle values with quotes in them.  There are many popular computer languages (Oracle comes to mind) that frequently generate gibberish like that.

 

Can you get it in a form that is not corrupted?

wilsonli
Calcite | Level 5
Hi, so do you think "Brown, Mrs James Joseph (Margaret "Molly" Tobin)" is actually corrupted? So do you think it is a good practice to manually fix the raw data from "Molly" to 'Molly' or Molly inside the double quotes before reading it? Thanks.
Tom
Super User Tom
Super User

@wilsonli wrote:
Hi, so do you think "Brown, Mrs James Joseph (Margaret "Molly" Tobin)" is actually corrupted? So do you think it is a good practice to manually fix the raw data from "Molly" to 'Molly' or Molly inside the double quotes before reading it? Thanks.

No.  It is best practice to write the text file properly the first time.

It might be possible to fix your existing file. But usually that is only possible when you know how many fields are on each line and there is only one field that could possibly have embedded quotes.  Then you can parse the fields before from the left and the fields after from the right and whatever is left over is what goes into the problem field.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 3204 views
  • 0 likes
  • 8 in conversation