BookmarkSubscribeRSS Feed
TeryK2
Calcite | Level 5

Hello, 

 

I have code that composes XML from parts that are stored in a table in a database. I export XML via the FILE command with options RECFM=N so that CRLF characters do not appear at the end. XML is encoded in UTF-8, but sas in WLATIN2.

 

If I use the setting RECFM=N, then SAS prompts encoding = UTF=8. If I don't use UTF8 encoding, the XML contains invalid characters... re-import and export is not possible, XML are too big

 

How to export XML in UTF8?

 

proc sql noprint;
select distinct PROCESS_ID
into :g_processIDList separated by ' '
from out_lib.Dbn_xml_data;
quit;

%put &g_processIDList.;


%macro DAR_XML_PROCESSING1;
	%if %length(&g_processIDList.) ne 0 %then %do;
		%do i=1 %to %sysfunc(countw(&g_processIDList., ' '));			
			%let l_processID=%scan(&g_processIDList.,&i,%str( ));
			%put zpracovávám žádost: &l_processID.;
			
			data HELP;
				set out_lib.DBN_XML_DATA (where=(PROCESS_ID = "&l_processID."));
			run;

			proc sort data=HELP out=HELP (keep=XML_CONTENT);
			by PART_ID;
			run;

			filename out "\\somePath\&l_processID..xml" RECFM=N /*encoding="utf-8"*/;
			data _null_;
			  file OUT;
			  set HELP;
			  put XML_CONTENT;
			run;
	
		%end;	
	%end;
	%else %do;
		%put [NOTE]: není co zpracovávat!;
	%end;
%mend DAR_XML_PROCESSING1;

%DAR_XML_PROCESSING1;

 

8 REPLIES 8
SASJedi
SAS Super FREQ

Is there any reason you aren't using the XMLV2 LIBNAME engine to export your SAS data to XML? Would something like this work for you?

data HELP;
   set out_lib.DBN_XML_DATA (where=(PROCESS_ID = "&l_processID."));
run;

proc sort data=HELP 
           out=HELP(keep=XML_CONTENT);
   by PART_ID;
run;

libname x xmlv2 "\\somePath\&l_processID..xml" XMLENCODING=UTF8;
proc copy in=work out=x;
	select help;
run;
libname x clear;

 

Check out my Jedi SAS Tricks for SAS Users
TeryK2
Calcite | Level 5
XML is splited into several parts in Oracle DB because XML is larger than the maximum possible length of a text variable. Therefore, before I process the XML via SAS, I just need to export to a new file. that's why I use the FILENAME command.
Tom
Super User Tom
Super User

I have idea what this statement means:

If I use the setting RECFM=N, then SAS prompts encoding = UTF=8.

I have never seen such a prompt.

 

Can you provide a small example of the text you are trying to write to the file?

 

What do you see in the file when you read it as binary data?  For example try dumping the first 500 bytes to the SAS log.

data _null_;
  infile "\\somePath\&l_processID..xml" recfm=f lrecl=100 obs=5;
  input;
  list;
run;

How are you trying to use the resulting file?  Are you reading it into SAS? Or passing it to some other software to use?

 

TeryK2
Calcite | Level 5

below you can see how the XML stored in oracle DB looks like. The XML is separated into several parts, the PROCESS_ID column contains the identification of the given XML. The table contains any number of XMLs each day.

XML_CONTENTPROCESS_IDPART_ID
xml body part 1ABC1
xml body part 2ABC2
Tom
Super User Tom
Super User

@TeryK2 wrote:

below you can see how the XML stored in oracle DB looks like. The XML is separated into several parts, the PROCESS_ID column contains the identification of the given XML. The table contains any number of XMLs each day.

XML_CONTENT PROCESS_ID PART_ID
xml body part 1 ABC 1
xml body part 2 ABC 2

I got that idea from your first post.  But it is not clear where you are having trouble.

As you state you cannot use ENCODING when writing using RECFM=N, but that shouldn't matter since SAS should not be doing any transcoding at that point.

So where do you get problems from the files?  

Do the resulting files not reflect what is stored in the XML_CONTENT variable in the ORACLE table? In what way?

Do the resulting XML files not get processed properly?  What code (or tool) are you using to process the XML?

 

Do the resulting XML files actually contain UTF-8 characters?  if so then you probably will want to process it using a SAS session started with encoding=UTF-8 since there are way more UTF-8 characters than can fit into any particular single byte encoding.  The session that reads the XML file need not be the same one that extracts it from the Oracle table.

TeryK2
Calcite | Level 5

the problem is that the XML composed from the parts stored in the table contains incorrect values ​​for UTF-8 encoding. XML is not valid. When I don't use the RECFM option, but option encoding="utf-8", the resulting XML is valid, but the parts (stored in table) are separated by CRLF sign, and that is another problem...

 

code for xml parsing below. When is XML valid, code is working perfectly

filename XML '\\some path\XML\03f68454-5969-11ef-9034-0a580a8202d9.xml' encoding="utf-8"; 
filename map '\\some path\test3.map'; 

libname XML xmlv2  xmlmap=map;

proc print data=XML.dar; 
run;  

 

Tom
Super User Tom
Super User

So there are a number of possible issues here.  The bytes are not getting moved from Oracle to SAS properly.  The XML file is not being generated properly.  And finally the XML is just not something that SAS can handle, either because it does in fact contain byte strings that are not valid UTF-8 characters, or because SAS just cannot figure out the structure of the XML and how to automap it into datasets.

 

It would be easiest if you could make a small example XML file that works with SAS but fails when run though your process (so you don't have to look through as much data).

 

To test if the bytes get from Oracle to SAS properly look at the HEX representation of the bytes and make sure they are the same.  In SAS you can use the $HEX format to display the bytes in your character variable.  You can check Oracle documentation for how to do something similar on the Oracle side.

 

Test if the bytes get from the SAS dataset into the XML file.  You can read in the XML file as BINARY data (use RECFM=F) and again use $HEX format to display the bytes.

 

Note that using code like you showed to read in a file that has UTF-8 characters in a SAS session that is using a single byte encoding will most likely result in some of the characters not making it into the SAS dataset because they cannot be transcoded from UTF-8 to the session encoding.  Any single byte encoding can only represent 256 characters and there are many more characters than that which could be present in a file using UTF-8 encoding.

 

If the file is getting transferred properly but SAS cannot read it then you might have to fix the file first.  For example you might be seeing this issue where the encoding of the XML is not actually UTF-8 (again something you can check by looking at actual files and the bytes that are in them) .

 

Tom
Super User Tom
Super User

Note also that splitting UTF8 strings into arbitrary substrings can cause transcoding issues when using UTF8 encoding to process the strings.

 

Since some UTF8 characters use more than one byte if you split the string in the middle of character you will have an invalid character at the end of the first string and beginning of the second string.

 

And a separate issue is that I believe Oracle does have character variables with lengths based on the number of CHARACTERS they hold instead of the number of BYTES.  This can cause confusion when transferring strings to SAS character variables where length is only defined by the number of BYTES.  If the variable in ORACLE can hold 5 characters that might mean up to 20 bytes depending on the characters being stored.  If that gets copied into a 5 byte long variable in the SAS dataset then you could lose some data.  And also again get incomplete UTF-8 characters (Invalid UTF-8 codes).

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 778 views
  • 2 likes
  • 3 in conversation