Hello,
I have code that composes XML from parts that are stored in a table in a database. I export XML via the FILE command with options RECFM=N so that CRLF characters do not appear at the end. XML is encoded in UTF-8, but sas in WLATIN2.
If I use the setting RECFM=N, then SAS prompts encoding = UTF=8. If I don't use UTF8 encoding, the XML contains invalid characters... re-import and export is not possible, XML are too big
How to export XML in UTF8?
proc sql noprint;
select distinct PROCESS_ID
into :g_processIDList separated by ' '
from out_lib.Dbn_xml_data;
quit;
%put &g_processIDList.;
%macro DAR_XML_PROCESSING1;
%if %length(&g_processIDList.) ne 0 %then %do;
%do i=1 %to %sysfunc(countw(&g_processIDList., ' '));
%let l_processID=%scan(&g_processIDList.,&i,%str( ));
%put zpracovávám žádost: &l_processID.;
data HELP;
set out_lib.DBN_XML_DATA (where=(PROCESS_ID = "&l_processID."));
run;
proc sort data=HELP out=HELP (keep=XML_CONTENT);
by PART_ID;
run;
filename out "\\somePath\&l_processID..xml" RECFM=N /*encoding="utf-8"*/;
data _null_;
file OUT;
set HELP;
put XML_CONTENT;
run;
%end;
%end;
%else %do;
%put [NOTE]: není co zpracovávat!;
%end;
%mend DAR_XML_PROCESSING1;
%DAR_XML_PROCESSING1;
Is there any reason you aren't using the XMLV2 LIBNAME engine to export your SAS data to XML? Would something like this work for you?
data HELP;
set out_lib.DBN_XML_DATA (where=(PROCESS_ID = "&l_processID."));
run;
proc sort data=HELP
out=HELP(keep=XML_CONTENT);
by PART_ID;
run;
libname x xmlv2 "\\somePath\&l_processID..xml" XMLENCODING=UTF8;
proc copy in=work out=x;
select help;
run;
libname x clear;
I have idea what this statement means:
If I use the setting RECFM=N, then SAS prompts encoding = UTF=8.
I have never seen such a prompt.
Can you provide a small example of the text you are trying to write to the file?
What do you see in the file when you read it as binary data? For example try dumping the first 500 bytes to the SAS log.
data _null_;
infile "\\somePath\&l_processID..xml" recfm=f lrecl=100 obs=5;
input;
list;
run;
How are you trying to use the resulting file? Are you reading it into SAS? Or passing it to some other software to use?
below you can see how the XML stored in oracle DB looks like. The XML is separated into several parts, the PROCESS_ID column contains the identification of the given XML. The table contains any number of XMLs each day.
XML_CONTENT | PROCESS_ID | PART_ID |
xml body part 1 | ABC | 1 |
xml body part 2 | ABC | 2 |
@TeryK2 wrote:
below you can see how the XML stored in oracle DB looks like. The XML is separated into several parts, the PROCESS_ID column contains the identification of the given XML. The table contains any number of XMLs each day.
XML_CONTENT PROCESS_ID PART_ID xml body part 1 ABC 1 xml body part 2 ABC 2
I got that idea from your first post. But it is not clear where you are having trouble.
As you state you cannot use ENCODING when writing using RECFM=N, but that shouldn't matter since SAS should not be doing any transcoding at that point.
So where do you get problems from the files?
Do the resulting files not reflect what is stored in the XML_CONTENT variable in the ORACLE table? In what way?
Do the resulting XML files not get processed properly? What code (or tool) are you using to process the XML?
Do the resulting XML files actually contain UTF-8 characters? if so then you probably will want to process it using a SAS session started with encoding=UTF-8 since there are way more UTF-8 characters than can fit into any particular single byte encoding. The session that reads the XML file need not be the same one that extracts it from the Oracle table.
the problem is that the XML composed from the parts stored in the table contains incorrect values for UTF-8 encoding. XML is not valid. When I don't use the RECFM option, but option encoding="utf-8", the resulting XML is valid, but the parts (stored in table) are separated by CRLF sign, and that is another problem...
code for xml parsing below. When is XML valid, code is working perfectly
filename XML '\\some path\XML\03f68454-5969-11ef-9034-0a580a8202d9.xml' encoding="utf-8";
filename map '\\some path\test3.map';
libname XML xmlv2 xmlmap=map;
proc print data=XML.dar;
run;
So there are a number of possible issues here. The bytes are not getting moved from Oracle to SAS properly. The XML file is not being generated properly. And finally the XML is just not something that SAS can handle, either because it does in fact contain byte strings that are not valid UTF-8 characters, or because SAS just cannot figure out the structure of the XML and how to automap it into datasets.
It would be easiest if you could make a small example XML file that works with SAS but fails when run though your process (so you don't have to look through as much data).
To test if the bytes get from Oracle to SAS properly look at the HEX representation of the bytes and make sure they are the same. In SAS you can use the $HEX format to display the bytes in your character variable. You can check Oracle documentation for how to do something similar on the Oracle side.
Test if the bytes get from the SAS dataset into the XML file. You can read in the XML file as BINARY data (use RECFM=F) and again use $HEX format to display the bytes.
Note that using code like you showed to read in a file that has UTF-8 characters in a SAS session that is using a single byte encoding will most likely result in some of the characters not making it into the SAS dataset because they cannot be transcoded from UTF-8 to the session encoding. Any single byte encoding can only represent 256 characters and there are many more characters than that which could be present in a file using UTF-8 encoding.
If the file is getting transferred properly but SAS cannot read it then you might have to fix the file first. For example you might be seeing this issue where the encoding of the XML is not actually UTF-8 (again something you can check by looking at actual files and the bytes that are in them) .
Note also that splitting UTF8 strings into arbitrary substrings can cause transcoding issues when using UTF8 encoding to process the strings.
Since some UTF8 characters use more than one byte if you split the string in the middle of character you will have an invalid character at the end of the first string and beginning of the second string.
And a separate issue is that I believe Oracle does have character variables with lengths based on the number of CHARACTERS they hold instead of the number of BYTES. This can cause confusion when transferring strings to SAS character variables where length is only defined by the number of BYTES. If the variable in ORACLE can hold 5 characters that might mean up to 20 bytes depending on the characters being stored. If that gets copied into a 5 byte long variable in the SAS dataset then you could lose some data. And also again get incomplete UTF-8 characters (Invalid UTF-8 codes).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.