Hi all,
I constructed a .map file to pull data from XML file, but I found it is impossible for me to include content in embedded tags. (i.e. if I have something like <P>(b) <E T="03">Purpose</E> Test test</P> , I can only get ''Test test'', the italicized 'Purpose' would not be included.
this is the .Map file I created (partial), I am able to pull the italicized words (content in tag <E></E>) in a separate column but that is not what I need. I need everything within a <P></P> be pulled together and in order. Would that be possible ?
<TABLE name = "Section1">
<TABLE-PATH syntax="XPATH"> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P </TABLE-PATH>
<COLUMN name="SECTNO" retain = "YES">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/SECTNO </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>
<COLUMN name="SUBJECT" retain = "YES">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/SUBJECT </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>
<COLUMN name="Topic">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P/E </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 1000 </LENGTH>
</COLUMN>
<COLUMN name="Clause">
<PATH> /CFRDOC/TITLE/CHAPTER/PART/SECTION/P </PATH>
<TYPE> character </TYPE>
<DATATYPE> string </DATATYPE>
<LENGTH> 5000 </LENGTH>
</COLUMN>
</TABLE>
Following is the sample XML, I have also attached the sample XML if anyone is interested in writing a .map file to pull data..
Please please share your insight and I am really appreciate your help on this.
<?xml version="1.0" encoding="UTF-8"?>
<CFRDOC xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="CFRMergedXML.xsd">
<AMDDATE>Jan. 1, 2018</AMDDATE>
<TITLE>
<CHAPTER>
<PART>
<SECTION>
<SECTNO>§ 1.1</SECTNO>
<SUBJECT>Authority, purpose, scope, and reservation of authority.</SUBJECT>
<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>
<P>(b) <E T="03">Purpose</E> This part prescribes standards under which national banks may purchase, sell, deal in, underwrite, and hold securities, consistent with the authority contained in 12 U.S.C. 24 (Seventh) and safe and sound banking practices.</P>
<P>(c) <E T="03">Scope.</E> The standards set forth in this part apply to national banks and Federal branches of foreign banks.Further, pursuant to 12 U.S.C. 335, State banks that are members of the Federal Reserve System are subject to the same limitations and conditions that apply to national banks in connection with purchasing, selling, dealing in, and underwriting securities and stock. In addition to activities authorized under this part, foreign branches of national banks are authorized to conduct international activities and invest in securities pursuant to 12 CFR part 211.</P>
<P>(d) <E T="03">Reservation of authority.</E> The OCC may determine, on a case-by-case basis, that a national bank may acquire an investment security other than an investment security of a type set forth in this part, provided the OCC determines that the bank's investment is consistent with 12 U.S.C. section 24 (Seventh) and with safe and sound banking practices. The OCC will consider all relevant factors, including the risk characteristics of the particular investment in comparison with the risk characteristics of investments that the OCC has previously authorized, and the bank's ability effectively to manage such risks. The OCC may impose limits or conditions in connection with approval of an investment security under this subsection. Investment securities that the OCC determines are permissible in accordance with this paragraph constitute eligible investments for purposes of 12 U.S.C. 24.</P>
<CITA>[61 FR 63982, Dec. 2, 1996, as amended at 73 FR 22235, Apr. 24, 2008]</CITA>
</SECTION>
</PART>
</CHAPTER>
</TITLE>
</CFRDOC>
Since you have Notepad++, you can see that the EOL markers are LF, which is the Unix marker.
If your SAS session runs under Windows, and since Windows uses CR+LF as EOL markers, windows only sees one record.
You can either:
- in Notepad++ => Click Edit > EOL conversion > Windows format
- in SAS => Use the TERMSTR option
I am not familiar enough with XML maps to help parsing inner italics tags, but if worse comes to worse, you could just strip the tags from the XML file before reading it.
data _null_;
A='<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>';
B=prxchange('s/(.+?)(<E.*?>(.+?)<\/E>)(.+?)/\1\3\4/',-1,A);
put A=/B=;
run;
A=<P>(a) <E T="03">Authority.</E> This part is issued pursuant to 12 U.S.C. 1 <E T="03">et seq.,</E> 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>
B=<P>(a) Authority. This part is issued pursuant to 12 U.S.C. 1 et seq., 12 U.S.C. 24 (Seventh), and 12 U.S.C. 93a.</P>
Hey Chris,
This works. But in order to to strip the tags like you said, I will need to pull the italicized words and normal sentence within a <P></P> together, and then strip in SAS to get rid of <E></E>. the first step I cannot accomplish right now..
I came up with a solution (very stupid one), is to open the XML in a notepad++ and replace all the <E></E> then the italicized words become normal and retrievable. But this is really the last solution I would take. the XML file is 30,000 lines in length not to mention there can be normal expression in the file with <E> and it is not a tag, but got mistakenly removed.
Hope those make sense.
Thanks.
Sorry it makes no sense. 🙂
Why can you do it in Notepad++ and not using SAS?
Did you try my tag-stripping code? It looks for closing tags before removing.
Hey Chris,
Yes I tried the tag-stripping code it works perfect. thanks for providing me that.
But let me explain: I only uses .map file to read XML in SAS, and I don't know any other way to read XML in SAS. (or open an XML file in SAS). The real XML data file is pretty huge with over 35000 lines of content and over 5000 <P></P> components (each with a <E> XXX </E> embedded) I don't know how to read them in SAS without using the .map file.
Thanks.
I fail to see what the issue is. I would:
1- Pre-process the XML file in a data step to strip the E tags. Cleansing 35,000 lines would take no time.
The code would be similar to this:
data _null_;
infile "&path\filename.xml" lrecl=5000;
file "&path\filename2.xml" lrecl=5000;
length STR $5000;
input;
STR=prxchange('s/(.+?)(<E.*?>(.+?)<\/E>)(.+?)/\1\3\4/',-1,_infile_);
put STR;
run;
2- Read the new XML file with the map.
Hi Chris,
I tried this way.. Problem is that the file is too large to be handled, I kept getting this error:
ERROR: The LRECL / LINESIZE for infile
'"XXXXX" exceeds the
maximum allowable length for an _INFILE_ or _INFILE_= variable (32,767).
The DATA STEP will not be executed.
Guess I will need to split this file into several smaller pieces and try again.
Thanks.
is the whole file in one record? is it a unix file read under windows? is there an end-of-record marker?
Hi ChrisNZ,
Could you please take a look at the attached sample XML file (zipped). I could not even load that using your code. Got the same error that it exceeds the maximum allowable length for an _INFILE_ or _INFILE_= variable (32,767).
Really appreciate your insight on this.
Thanks.
Since you have Notepad++, you can see that the EOL markers are LF, which is the Unix marker.
If your SAS session runs under Windows, and since Windows uses CR+LF as EOL markers, windows only sees one record.
You can either:
- in Notepad++ => Click Edit > EOL conversion > Windows format
- in SAS => Use the TERMSTR option
Hi Chris,
Thanks for your suggestion problem solved! This is awesome!
Appreciate your help on this!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.