BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ciphercong
Fluorite | Level 6

Hi all,

 

I constructed a .map file to pull data from XML file, but I found it is impossible for me to include content in embedded tags. (i.e. if I have something like   <P>(b) <E T="03">Purpose</E> Test test</P> ,  I can only get ''Test test'', the italicized 'Purpose' would not be included. 

 

this is the .Map file I created (partial), I am able to pull the italicized words (content in tag <E></E>) in a separate column but that is not what I need. I need everything within a <P></P> be pulled together and in order. Would that be possible ?


<TABLE name = "Section1">
<TABLE-PATH syntax="XPATH"> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P </TABLE-PATH>


<COLUMN name="SECTNO" retain = "YES">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/SECTNO </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>

 

<COLUMN name="SUBJECT" retain = "YES">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/SUBJECT </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>

<COLUMN name="Topic">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P/E </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>

<COLUMN name="Clause">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 5000 </LENGTH>
</COLUMN>

</TABLE>

 

 

Following is the sample XML, I have also attached the sample XML if anyone is interested in writing a .map file to pull data.. 

 

Please please share your insight and I am really appreciate your help on this.

 

 

<?xml version="1.0" encoding="UTF-8"?>
<CFRDOC xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="CFRMergedXML.xsd">
<AMDDATE>Jan. 1, 2018</AMDDATE>
<TITLE>
<CHAPTER>
<PART>
<SECTION>
<SECTNO>§ 1.1</SECTNO>
<SUBJECT>Authority, purpose, scope, and reservation of authority.</SUBJECT>
<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>
<P>(b) <E T="03">Purpose</E> This part prescribes standards under which national banks may purchase, sell, deal in, underwrite, and hold securities, consistent with the authority contained in 12 U.S.C. 24 (Seventh) and safe and sound banking practices.</P>
<P>(c) <E T="03">Scope.</E> The standards set forth in this part apply to national banks and Federal branches of foreign banks.Further, pursuant to 12 U.S.C. 335, State banks that are members of the Federal Reserve System are subject to the same limitations and conditions that apply to national banks in connection with purchasing, selling, dealing in, and underwriting securities and stock. In addition to activities authorized under this part, foreign branches of national banks are authorized to conduct international activities and invest in securities pursuant to 12 CFR part 211.</P>
<P>(d) <E T="03">Reservation of authority.</E> The OCC may determine, on a case-by-case basis, that a national bank may acquire an investment security other than an investment security of a type set forth in this part, provided the OCC determines that the bank's investment is consistent with 12 U.S.C. section 24 (Seventh) and with safe and sound banking practices. The OCC will consider all relevant factors, including the risk characteristics of the particular investment in comparison with the risk characteristics of investments that the OCC has previously authorized, and the bank's ability effectively to manage such risks. The OCC may impose limits or conditions in connection with approval of an investment security under this subsection. Investment securities that the OCC determines are permissible in accordance with this paragraph constitute eligible investments for purposes of 12 U.S.C. 24.</P>
<CITA>[61 FR 63982, Dec. 2, 1996, as amended at 73 FR 22235, Apr. 24, 2008]</CITA>
</SECTION>
</PART>
</CHAPTER>
</TITLE>
</CFRDOC>

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Since you have Notepad++, you can see that the EOL markers are LF, which is the Unix marker.

If your SAS session runs under Windows, and since Windows uses CR+LF as EOL markers, windows only sees one record.

 

You can either: 

- in Notepad++  =>  Click Edit > EOL conversion > Windows format

- in SAS  => Use the TERMSTR option

 

View solution in original post

10 REPLIES 10
ChrisNZ
Tourmaline | Level 20

I am not familiar enough with XML maps to help parsing inner italics tags, but if worse comes to worse, you could just strip the tags from the XML file before reading it.

 

data _null_;
  A='<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>';
  B=prxchange('s/(.+?)(<E.*?>(.+?)<\/E>)(.+?)/\1\3\4/',-1,A);
  put A=/B=;
run;

A=<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>
B=<P>(a) Authority. This part is issued pursuant to 12 U.S.C. 1 et seq., 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>

 

 

ciphercong
Fluorite | Level 6

Hey Chris,

 

This works. But in order to to  strip the tags like you said, I will need to pull the italicized words and normal sentence within a <P></P> together, and then strip in SAS to get rid of <E></E>. the first step I cannot accomplish right now..

 

I came up with a solution (very stupid one), is to open the XML in a notepad++ and replace all the <E></E> then the italicized words become normal and retrievable.  But this is really the last solution I would take. the XML file is 30,000 lines in length not to mention there can be normal expression in the file with <E> and it is not a tag, but got mistakenly removed.

 

Hope those make sense.

 

Thanks.

ChrisNZ
Tourmaline | Level 20

Sorry it makes no sense. 🙂

Why can you do it in Notepad++ and not using SAS?

Did you try my tag-stripping code? It looks for closing tags before removing.

ciphercong
Fluorite | Level 6

Hey Chris,

 

Yes I tried the tag-stripping code it works perfect. thanks for providing me that.

 

But let me explain:  I only uses .map file to read XML in SAS, and I don't know any other way to read XML in SAS. (or open an XML file in SAS). The real XML data file is pretty huge with over 35000 lines of content and over 5000 <P></P> components (each with a <E> XXX </E> embedded) I don't know how to read them in SAS without using the .map file. 

 

 

Thanks.

ChrisNZ
Tourmaline | Level 20

I fail to see what the issue is. I would:

1- Pre-process the XML file in a data step to strip the E tags. Cleansing 35,000 lines would take no time.

The code would be similar to this:

data _null_;
  infile "&path\filename.xml"  lrecl=5000;
  file   "&path\filename2.xml" lrecl=5000;
  length STR $5000;
  input;
  STR=prxchange('s/(.+?)(<E.*?>(.+?)<\/E>)(.+?)/\1\3\4/',-1,_infile_);
  put STR;
run;

2- Read the new XML file with the map.

ciphercong
Fluorite | Level 6

Hi Chris,

 

I tried this way.. Problem is that the file is too large to be handled, I kept getting this error: 

 

 

ERROR: The LRECL / LINESIZE for infile
'"XXXXX" exceeds the
maximum allowable length for an _INFILE_ or _INFILE_= variable (32,767).
The DATA STEP will not be executed.

 

 

Guess I will need to split this file into several smaller pieces and try again.

 

 

Thanks.

ChrisNZ
Tourmaline | Level 20

is the whole file in one record? is it a unix file read under windows? is there an end-of-record marker?

ciphercong
Fluorite | Level 6

Hi ChrisNZ,

 

Could you please take a look at the attached sample XML file (zipped).  I could not even load that using your code. Got the same error that it  exceeds the maximum allowable length for an _INFILE_ or _INFILE_= variable (32,767).

 

 

Really appreciate your insight on this.

 

Thanks.

ChrisNZ
Tourmaline | Level 20

Since you have Notepad++, you can see that the EOL markers are LF, which is the Unix marker.

If your SAS session runs under Windows, and since Windows uses CR+LF as EOL markers, windows only sees one record.

 

You can either: 

- in Notepad++  =>  Click Edit > EOL conversion > Windows format

- in SAS  => Use the TERMSTR option

 

ciphercong
Fluorite | Level 6

Hi Chris,

 

Thanks for your suggestion problem solved! This is awesome!

 

 

Appreciate your help on this!

 

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1293 views
  • 10 likes
  • 2 in conversation