XML Mapper loosing CRLF characters

Reply
Occasional Contributor
Posts: 12

XML Mapper loosing CRLF characters

Hello all

     Thanks for reading this first off. This is a re-post to get new exposure to a fustrating problem.

          Here's the deal, when I read an XML CDATA field into SAS I loose '0D0A'x characters. Even when I specify a $charw. informat! I've searched everywhere to figure out how to retain these characters when reading character data that contains them, but to no avail.     

     For example, if the XML raw data is as shown below:

        ...

          <Item_Response><![CDATA[ This is the data I want.

                                        I just pressed Enter.

                                        Again.]]>

          </Item_Response>

         ...

     Instead of this: "This is the data I want.'0D0A'xI just pressed Enter.'0D0A'xAgain."

             I get this: "This is the data I want.'2020'xI just pressed Enter.'2020'xAgain."

Below is the part of the XML Map corresponding to the misbehaving field:

        <COLUMN name="Item_Response">

            <PATH syntax="XPath">/QS_Scoring/CR_Item_Resp_Record/Item_Response</PATH>

            <TYPE>character</TYPE>

            <DATATYPE>string</DATATYPE>

            <LENGTH>20000</LENGTH>

            <FORMAT width="20000">$CHAR</FORMAT>

            <INFORMAT width="20000">$CHAR</INFORMAT>

        </COLUMN>

Below is the SAS code libname statment (I'm using SAS 9.2):

     libname XML_lib xml "&workhere.\MyFile.xml" xmlmap="&workhere.\XML_MAP.map";

Help?

Huey

SAS Employee
Posts: 4

Re: XML Mapper loosing CRLF characters

The XML specification (Extensible Markup Language (XML) 1.0 (Fifth Edition)) doesn't directly allow that: 

2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These ...

To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

Note that the EOL translation occurs before parsing, so it makes no difference that it happens in a CDATA section.

You should be able to escape the CR and LF characters as follows, but you'll have to handle the un-escaping as well.

<Item_Response>

    <![CDATA[ This is the data I want.&#x0D;&#x0A;I just pressed Enter.&#x0D;&#x0A;Again.]]>                                       

</Item_Response>

Occasional Learner
Posts: 1

Re: XML Mapper loosing CRLF characters

This is an issue with the older SAS XML engine. Whitespace clusters are replaced with a single space. A notable exception is the linefeed character entity. &#xD; and &#13; will be replaced with the linefeed character, no CDATA block needed.

Since you're using SAS 9.2, you have access to the newer XML92 engine, which leaves whitespace (including but not necessarily limited to newlines, tabs, and multiple spaces) intact. Simply replace "xml" with "xml92" in your libname statement. See the following paper for some discussion of this and other changes in the XML engine (including a new directive to permit or reject some literal characters like quotation marks): http://www2.sas.com/proceedings/sugi28/173-28.pdf‎

If you're using LoadXL.sas to read a Microsoft Excel XML file, amend the libname statement on line 138 of the program. (Note also that the ExcelXP.map XML map truncates strings at 1024 characters; edit line 92 to increase the length.) See the following paper for details on reading and writing Excel workbooks: http://www2.sas.com/proceedings/sugi31/115-31.pdf (Note the download referenced in the paper is now in the 2006 archives.) If anyone is aware of an updated paper, please let me know. Thanks!

Ask a Question
Discussion stats
  • 2 replies
  • 1166 views
  • 0 likes
  • 3 in conversation