Solved: Re: Tricky things when parsing Data from XML

ciphercong · Posted 09-17-2018 04:47 PM

Hi all,

I constructed a .map file to pull data from XML file, but I found it is impossible for me to include content in embedded tags. (i.e. if I have something like <P>(b) <E T="03">Purpose</E> Test test</P> , I can only get ''Test test'', the italicized 'Purpose' would not be included.

this is the .Map file I created (partial), I am able to pull the italicized words (content in tag <E></E>) in a separate column but that is not what I need. I need everything within a <P></P> be pulled together and in order. Would that be possible ?

<TABLE name = "Section1">
<TABLE-PATH syntax="XPATH"> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P </TABLE-PATH>

<COLUMN name="SECTNO" retain = "YES">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/SECTNO </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>

<COLUMN name="SUBJECT" retain = "YES">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/SUBJECT </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>

<COLUMN name="Topic">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P/E </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>

<COLUMN name="Clause">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 5000 </LENGTH>
</COLUMN>

</TABLE>

Following is the sample XML, I have also attached the sample XML if anyone is interested in writing a .map file to pull data..

Please please share your insight and I am really appreciate your help on this.

<?xml version="1.0" encoding="UTF-8"?>
<CFRDOC xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="CFRMergedXML.xsd">
<AMDDATE>Jan. 1, 2018</AMDDATE>
<TITLE>
<CHAPTER>
<PART>
<SECTION>
<SECTNO>§ 1.1</SECTNO>
<SUBJECT>Authority, purpose, scope, and reservation of authority.</SUBJECT>
<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>
<P>(b) <E T="03">Purpose</E> This part prescribes standards under which national banks may purchase, sell, deal in, underwrite, and hold securities, consistent with the authority contained in 12 U.S.C. 24 (Seventh) and safe and sound banking practices.</P>
<P>(c) <E T="03">Scope.</E> The standards set forth in this part apply to national banks and Federal branches of foreign banks.Further, pursuant to 12 U.S.C. 335, State banks that are members of the Federal Reserve System are subject to the same limitations and conditions that apply to national banks in connection with purchasing, selling, dealing in, and underwriting securities and stock. In addition to activities authorized under this part, foreign branches of national banks are authorized to conduct international activities and invest in securities pursuant to 12 CFR part 211.</P>
<P>(d) <E T="03">Reservation of authority.</E> The OCC may determine, on a case-by-case basis, that a national bank may acquire an investment security other than an investment security of a type set forth in this part, provided the OCC determines that the bank's investment is consistent with 12 U.S.C. section 24 (Seventh) and with safe and sound banking practices. The OCC will consider all relevant factors, including the risk characteristics of the particular investment in comparison with the risk characteristics of investments that the OCC has previously authorized, and the bank's ability effectively to manage such risks. The OCC may impose limits or conditions in connection with approval of an investment security under this subsection. Investment securities that the OCC determines are permissible in accordance with this paragraph constitute eligible investments for purposes of 12 U.S.C. 24.</P>
<CITA>[61 FR 63982, Dec. 2, 1996, as amended at 73 FR 22235, Apr. 24, 2008]</CITA>
</SECTION>
</PART>
</CHAPTER>
</TITLE>
</CFRDOC>

ChrisNZ · Posted 09-19-2018 06:27 PM

Since you have Notepad++, you can see that the EOL markers are LF, which is the Unix marker.

If your SAS session runs under Windows, and since Windows uses CR+LF as EOL markers, windows only sees one record.

You can either:

- in Notepad++ => Click Edit > EOL conversion > Windows format

- in SAS => Use the TERMSTR option

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 09-17-2018 08:25 PM

I am not familiar enough with XML maps to help parsing inner italics tags, but if worse comes to worse, you could just strip the tags from the XML file before reading it.

data _null_;
  A='<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>';
  B=prxchange('s/(.+?)(<E.*?>(.+?)<\/E>)(.+?)/\1\3\4/',-1,A);
  put A=/B=;
run;

A=<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>
B=<P>(a) Authority. This part is issued pursuant to 12 U.S.C. 1 et seq., 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>

High-Performance SAS Coding - Third Edition

ciphercong · Posted 09-18-2018 09:16 AM

Hey Chris,

This works. But in order to to strip the tags like you said, I will need to pull the italicized words and normal sentence within a <P></P> together, and then strip in SAS to get rid of <E></E>. the first step I cannot accomplish right now..

I came up with a solution (very stupid one), is to open the XML in a notepad++ and replace all the <E></E> then the italicized words become normal and retrievable. But this is really the last solution I would take. the XML file is 30,000 lines in length not to mention there can be normal expression in the file with <E> and it is not a tag, but got mistakenly removed.

Hope those make sense.

Thanks.

ChrisNZ · Posted 09-18-2018 06:16 PM

Sorry it makes no sense. 🙂

Why can you do it in Notepad++ and not using SAS?

Did you try my tag-stripping code? It looks for closing tags before removing.

High-Performance SAS Coding - Third Edition

ciphercong · Posted 09-18-2018 11:10 PM

Hey Chris,

Yes I tried the tag-stripping code it works perfect. thanks for providing me that.

But let me explain: I only uses .map file to read XML in SAS, and I don't know any other way to read XML in SAS. (or open an XML file in SAS). The real XML data file is pretty huge with over 35000 lines of content and over 5000 <P></P> components (each with a <E> XXX </E> embedded) I don't know how to read them in SAS without using the .map file.

Thanks.

ChrisNZ · Posted 09-18-2018 11:18 PM

I fail to see what the issue is. I would:

1- Pre-process the XML file in a data step to strip the E tags. Cleansing 35,000 lines would take no time.

The code would be similar to this:

data _null_;
  infile "&path\filename.xml"  lrecl=5000;
  file   "&path\filename2.xml" lrecl=5000;
  length STR $5000;
  input;
  STR=prxchange('s/(.+?)(<E.*?>(.+?)<\/E>)(.+?)/\1\3\4/',-1,_infile_);
  put STR;
run;

2- Read the new XML file with the map.

High-Performance SAS Coding - Third Edition

ciphercong · Posted 09-18-2018 11:23 PM

Hi Chris,

I tried this way.. Problem is that the file is too large to be handled, I kept getting this error:

ERROR: The LRECL / LINESIZE for infile
'"XXXXX" exceeds the
maximum allowable length for an _INFILE_ or _INFILE_= variable (32,767).
The DATA STEP will not be executed.

Guess I will need to split this file into several smaller pieces and try again.

Thanks.

ChrisNZ · Posted 09-18-2018 11:53 PM

is the whole file in one record? is it a unix file read under windows? is there an end-of-record marker?

High-Performance SAS Coding - Third Edition

ciphercong · Posted 09-19-2018 04:15 PM

Hi ChrisNZ,

Could you please take a look at the attached sample XML file (zipped). I could not even load that using your code. Got the same error that it exceeds the maximum allowable length for an _INFILE_ or _INFILE_= variable (32,767).

Really appreciate your insight on this.

Thanks.

ChrisNZ · Posted 09-19-2018 06:27 PM

Since you have Notepad++, you can see that the EOL markers are LF, which is the Unix marker.

If your SAS session runs under Windows, and since Windows uses CR+LF as EOL markers, windows only sees one record.

You can either:

- in Notepad++ => Click Edit > EOL conversion > Windows format

- in SAS => Use the TERMSTR option

High-Performance SAS Coding - Third Edition

ciphercong · Posted 09-20-2018 01:11 AM

Hi Chris,

Thanks for your suggestion problem solved! This is awesome!

Appreciate your help on this!

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away