Re: XML Encoding UTF8 Issue

jklein_271 · Posted 03-29-2013 04:43 PM

I'm currently trying to import data from an xml document with UTF-8 encoding. To do this, I'm using the xml libname engine along with the xmlmap= option to point to an xml map generated using the 9.3 xml mapper. The xml map specifies encoding="UTF-8", but characters specific to the UTF-8 character set are not presented properly in the data sets residing in the library resulting from the xml libname statement. If I try and use the encoding=utf8 data set option when pulling from the XML library to the work library, the encoding on the dataset is then set accurately (per proc contents), but it's already too late as the XML library has already translated the character set incorrectly.

As I obviously can't change the session encoding with the encoding system option except for at startup, the only way I've been able to get this to work is by using the SAS 9.3 unicode server. Obviously I can replicate this by adding the -encoding option to the sasv9.cfg that Enterprise Guide is using at startup, but I do not want to change all of my projects to UTF8. Is there a better solution to my UTF8 dilemma? I've looked through the xml libname documentation and all of the encoding options are specifically for writing files and not reading files. I can't find any other way to set the encoding for the engine so that the data is translated properly from the beginning. I guess I expected that if the encoding was specified at the top of the xml map that is being used by the xml libname engine, that it would drive the encoding of the data sets within the library. If I am stuck with this situation, is there any way for me have this one project startup with UTF8 only without impacting every Base SAS and Enterprise Guide session being impacted? I'm open to any suggestions as I've been working on this all day and I've run out of ideas.

Running SAS 9.3 TS Level 1 M0

I have access to EG 4.3 / EG 5.1

jklein_271 · Posted 04-01-2013 08:43 AM

A quick follow up with things that I've tried since...

1) the %include <sas_program> encoding="utf-8" approach: I think I was some somehow hoping that the SAS program would somehow execute in a different encoding with this approach. Unfortunately, the encoding option here appears to only be applicable when you are using %include with an external file (not a SAS program) which specifies what the external file character set is and not the session encoding a SAS program would execute in.

2) I was more hopeful with my 2nd idea, but same result (failure). I added the encoding="utf-8" option to the filename for the statement referencing the XML file that is actually encoded in utf-8. The result here is a transcoding error for each data set creating with the xmlv2 libname engine. Everything keeps coming back to the fact that there are almost no applicable options for the xml libname engine when importing (lots dealing with character sets / encoding / etc when exporting an XML file from a SAS data set). The library established with the XML libname statement seems to always drive off the session encoding regardless of whatever options I set for the external file.

3) I also tried to set the encoding= system option in the "execute when SAS server connects" section of the EG options. Unfortunately, it seems like by the time EG executes this SAS code, it's already too late for SAS system options that need to be established before session start. I guess this approach was always setup for failure as my approach here was to drive off the project name automatic mac var in EG to conditionally set the encoding session based on file name. This way I could have this one project open up in utf-8 encoding and my remaining EG projects and base SAS code would remain in wlatin1 encoding.

It's looking more and more like my only option here is to schedule my code with windows scheduler using base SAS and the unicode config. My big downside here is that I have a project established already that I can schedule that contains conditional logic using EG. Converting all of this conditional logic to base SAS macro error logic will take some time and not as clean as my EG setup.

Thanks to anyone who reads this rant and/or can provide any insight.

- Jordan

BillM_SAS · Posted 04-01-2013 12:55 PM

Can you post the XML file you are trying to import and the XML Map file being used for the import?

jklein_271 · Posted 04-02-2013 07:37 PM

Thanks for replying, Bill. I attached the XML file and a condensed version (focusing on the field that's having an issue with this XML file) to the original post. The code below works only if I execute using SAS 9.3 with unicode support. I have been unable to successfully load the file in enterprise guide (wlatin1 default session encoding) with any option I've found in the SAS 9.3 OnlineDoc. Below the code is the error I'm getting in EG. The specific character that the transcoding is having a problem with is the 4th "word" in the string (A Randomized Phase II) captured in the official_title field. The HTML character entity in decimal format (Ⅱ) which should be represented as a roman numeral II symbol. After I load this correctly (I can see the correct II character and can read from the data set) in base unicode, I am unable to transcode to wlatin1 so I'm guessing the xmlv2 libname engine is having the same issue as it appears I can't control which encoding the xml libname actually uses (uses session encoding no matter what). It can't transcode to wlatin1 so it bombs as soon as I try and read from the xml libname data set or view it. In addition, I can create an equivalent v1.2 xml map and use the xml engine as opposed to the xmlv2 engine and it does complete without any errors. Unfortunately, it converts the II symbol to a different character (which isn't equivalent) with no warning or error in the log. As far as I know, Ⅱ is unique to the UTF-8 character set.

filename SXLELIB "C:\NCT01374750.xml";

filename SXLEMAP "C:\test_v21_utf8.map";

libname SXLELIB xmlv2 xmlmap=SXLEMAP access=READONLY;

data CLINICAL_STUDY;

set SXLELIB.CLINICAL_STUDY;

run;

EG Error:

ERROR: Some code points did not transcode.

occurred at or near line 19, column 230

ERROR: XML parsing error. Please verify that the XML content is well-formed.

BillM_SAS · Posted 04-03-2013 05:07 PM

If I understand you correctly:

XML file contains roman numeral 2 (U+2161 or HTML &#8545).
Roman numeral 2 is available in UTF-8.
Roman numeral 2 is not available in WLATIN1.
You run SAS with UTF-8 session encoding and can see Roman numeral 2 in the data.
You want to run SAS with WLATIN1 session encoding and import the same XML file.
The XMLV2 LIBNAME engine has errors trying to transcode Roman numeral 2 into the current SAS session of WLATIN1.

As far as I can tell, it all comes down to the SAS session encoding. The XMLV2 LIBNAME engine transcodes the imported data to the SAS session encoding. Even the ENCODING= data set option notes, "If the session encoding and the encoding that is specified in the file are different, SAS transcodes the data to the session encoding."

I'm checking with the Enterprise Guide experts to see if they have any recommendations concerning your issue.

jklein_271 · Posted 04-03-2013 09:20 PM

Yes, Bill. I think you've summarized it well. In a perfect world, I'd be able to tell EG that the XML I'm consuming is UTF-8 and it would accurately transcode to WLATIN1. I'm not an HTML character set expert so I'm guessing that this may not be possible if my data contains characters that are unique to the UTF-8 code set.

That being said, the next best thing would be to correctly import the UTF-8 XML data and store in UTF-8 encoded data sets all while in the default WLATIN1 encoded session. This currently seems like a stretch as I can't apply any encoding options to either the libname statement or data set if it is connected to the xml or xmlv2 libname engine. Some of this syntax may be off as I'm typing on the fly but this is what I envisioned.

filename SXLELIB "C:\NCT01374750.xml" encoding="UTF-8";

filename SXLEMAP "C:\test_v21_utf8.map";

libname SXLELIB xmlv2 xmlmap=SXLEMAP access=READONLY;

libname UTF8 "<path>" outencoding="UTF-8";

data UTF8.CLINICAL_STUDY;

set SXLELIB.CLINICAL_STUDY (encoding="UTF-8");

run;

This way I'm telling SAS that the filename (XML file) is encoded as UTF-8, the resulting xml library is then UTF-8, and I want the permanent data set being stored in UTF-8 encoding as well. As the encoding data set option is not allowed on the set statement above as it's relating to the XML libname, this approach is just a pipe dream.

If all else fails, a way to make a specific project launch in UTF-8 would be sufficient. I can schedule a .sas program with windows scheduler and load the appropriate unicode config file, but I then lose all of the conditional processing I am leveraging in EG.

Thanks again, Bill.

- Jordan

ChrisHemedinger · Posted 04-04-2013 08:51 AM

I think that has done a great job of diagnosing the issue here, and the limitation: your SAS session encoding plays a big part in this, even though filerefs and data sets also support "local encoding" overrides.

I can see a couple of options.

First, if you need to routinely work with characters that map outside of wlatin1, consider using -ENCODING UTF-8 as SAS startup option. If you don't want to change your server from wlatin1 for most scenarios, consider setting up a second logical SAS workspace definition (ex: "SASApp UTF8") that looks exactly like your main SAS workspace except for the encoding options. You can use it specifically for this operation to get your data into SAS, and then use your other workspace definition for everything else. This can be done in SAS Management Console -- no additional configuration, software, or hardware is needed. You're simply defining a different SAS Workspace with an alternative startup command.

Or, if the XML you're reading contains characters that you don't care about, consider using a DATA step to "preprocess" the XML before importing it. You can have a DATA step that reads the XML file byte-by-byte and when it encounters a character that is out of range, replace it with another character or omit it. This approach might be useful for files that contain funky quote characters that visually represent what looks like a standard character, but for some reason a more "exotic" Unicode character was used.

Chris

Become an Explorer! Join SAS Analytics Explorers to learn and complete challenges that earn rewards!

RogerSpeas · Posted 04-05-2013 03:28 PM

Hi Chris,

I might be unclear, but did Jordan mentioned...scheduling his SAS job on the windows scheduler... could he be using a SAS workstation (local). Would it be wise to suggest changing back and forth between the different SAS registration?

"C:\Program Files\SASHome\SASFoundation\9.3\sas.exe" -CONFIG "C:\Program Files\SASHome\SASFoundation\9.3\nls\u8\sasv9.cfg -regserver

"C:\Program Files\SASHome\SASFoundation\9.3\sas.exe" -regserver

So if he instead schedule his project, it would use the appropriate configuration. My follow through thought was that Jordan might he be able to edit the vbscript and insert the two registration command before and after the EG generated vbscript...

However, in Win7, I have shortcut for several different configurations, but they need to be Run As Adminstrator. So the problem, that I might see with adding to the script -regserver needs to Run as an Administrator. Ideas?

-Roger

jklein_271 · Posted 04-05-2013 04:03 PM

Thanks to all for replying. I've learned quite a bit from this discussion.

That was actually a question I was just about to ask (local vs client/server setup). I've previously worked with SAS metadata server and figured the management console would not be a solution for me this time as I'm currently on just a local server setup. I'm running Windows 7 Professional with the SAS 9.3 Analytics Pro Package with just the EG front end with local server only.

To answer's Chris's 2nd question, I will need to retain all characters in the data (including the ones unique to UTF-8). That being said, I have just one more round of questions before I am set:

1) Chris: I guess I'm just used to the default WLATIN1 session encoding. From what I've been able to gather, UTF-8 is an international standard that contains all of the characters in the WLATIN1 code set with the addition of characters (mostly related to academia?) unique to UTF-8. If this is in fact true, are there any downsides to changing my default session to UTF-8 permanently? I'm assuming since UTF-8 is the broader code set that transcoding my existing WLATIN1 data sets to UTF-8 should be pretty seamless and happen as I update permanent libraries. If you can point me in any direction, I'd appreciate it. I have the SAS 9.3 OnlineDoc and have been trying to read up on encoding, etc in the NLS documentation and any other PDFs that discuss encoding as it relates to SAS sessions, files, and data sets.

2) Roger: As I am running Windows 7 and have no problems running as administrator, can you please expand on this shortcut for running several different configs? Depending on Chris's response to number 1 (or what I find out this weekend about changing my encoding default to UTF-8 permanently), it would be great to have the option to have the VB script that launches my scheduled project control the default encoding session on start up. I didn't even think to check the VB script as I usually just let SAS create it in the folder and never touch it again. This could be extremely helpful.

Thanks again to all. I was spinning my wheels and getting nowhere as I have been working with SAS for a long time, but loading XML data, character code sets, and encoding options are very much new to me.

- Jordan

RogerSpeas · Posted 04-05-2013 08:59 PM

Late on Friday, I might have been overly optimistic Chris could have replied. I had a nagging suspicionthat -regserver would not work. So, I have to relearn something I already know. The meat of the problem, -regserver, ignores all other options on the command string/target. A possible solution is to use an environment variables and thus you won’t need to use regserver. You can skip down…. “continue here”

What’s regserver…

let’s start with an easy example. If I you want to SAS32 or SAS64, one just select the appropriate icon from the SAS program group. Question... which of the two will EG use the 32 or 64? The answer is whichever one is currently registered. So, if I want EG to run the 32 bit version, one would register the 32 bit version of SAS, which is what the option -regserver does. I believe regserver commits to the registry which sas executable is to execute.. Kind of like what open .sas files, EG or SAS.

One can simply make a copy of the appropriate SAS icon and edit the properties, by adding -regserver to the end of the command string/target. The options you add to the command string (target) typically supplement the configuration startup options (not an autoexec). With this option, the icon will no longer open the SAS display manager (windows) but registers which version of SAS that EG
will use.

Rummaging through the SAS Program files group... I have copies of my SAS icons, SAS 9.2, SAS9.3 (32), SAS9.3 (64). Renamed with Regserver in the title and change the command string/target with -regserver. This allow me to continue to use EG4.2 and EG4.3 on the same machine. So, you would think why not add, –encoding utc-8, option as well.

If you have install the Unicode langauge set, you would also have an icon, SAS 9.3 (Unicode Server) which uses the same command string as SAS 9.3, but also includes a -CONFIG option that redirect SAS to use the configuration file in the u8 directory (which contains –encoding). The problem options such as -encoding and –config, are overlooked when you use the –regserver option.

Continue here….

I like my SAS to open with a different default path, so I can add the –sasinitialfolder to my SAS icons commands. To get EG to recognize a different path… I have to get medieval with the default configuration file and add environment variables. You can set an environment variable with..

SET SIFVar=”c:\temp” or SETX /m SIFVar “c:\temp”.

And then, I add the following statement to top of the default configuration file…
-SASINITIALFOLDER '!SIFVar'.

I can now change the default path without have to edit the configuration file.

You could create a Windows environment variable, let say, SASEncode, you'll probably want a default value of WLATIN-1. Create
two batch file one that would set the SASEncode value to UTF-8 and the other WLATIN-1. So instead of registering different version of SAS, you would be registering different values in the environment variables…that would subsequently be read by the SAS that EG starts up.

-ENCODING !SASEncode

-Roger

jklein_271 · Posted 01-29-2014 10:50 AM

To revive this old thread of mine, can anyone provide any info relating to my question below? At this point, it seems to make the most sense to just change my default encoding for all sessions from WLATIN1 to UTF-8. It seems safe as I should be able to "downgrade" to WLATIN1 using transcoding if I ever need to. That being said, I can't "upgrade" to UTF-8 as it is a more expansive character set. I'd feel better about the config change (which I'd imagine will transcode each of my datasets as they are processed) if someone with more experience with encoding and locale settings in general can alleviate my concerns. Thanks!

"I guess I'm just used to the default WLATIN1 session encoding. From what I've been able to gather, UTF-8 is an international standard that contains all of the characters in the WLATIN1 code set with the addition of characters (mostly related to academia?) unique to UTF-8. If this is in fact true, are there any downsides to changing my default session to UTF-8 permanently? I'm assuming since UTF-8 is the broader code set that transcoding my existing WLATIN1 data sets to UTF-8 should be pretty seamless and happen as I update permanent libraries. If you can point me in any direction, I'd appreciate it. I have the SAS 9.3 OnlineDoc and have been trying to read up on encoding, etc in the NLS documentation and any other PDFs that discuss encoding as it relates to SAS sessions, files, and data sets."

ChrisHemedinger · Posted 01-29-2014 11:48 AM

You should review this paper:

http://support.sas.com/resources/papers/proceedings13/025-2013.pdf

It covers the concepts of internationalization, including the text/character manipulation issues that you might encounter with UTF-8.

Chris

Become an Explorer! Join SAS Analytics Explorers to learn and complete challenges that earn rewards!

agoldma · Posted 07-06-2017 02:19 PM

If the enoding problem happened upon the creation of XML, then it's probably necessary to "preprocess" the XML text prior to reading it (like Chris suggested above)

Search for things LIKE "&#___;" ... that are bad translations of HTML codes

I removed them with this code:

line1_fixed = prxchange('s/&#([0-9])+;//',-1, line1);

mojerry2 · Posted 03-05-2015 10:01 AM

Did you found a solution?

I'm trying to import an xml file encode in utf 8 but it tells me it's going wrong.

The line he doesn't like : <WHOLENAME>Οργάνωση</WHOLENAME>

jakarman · Posted 03-08-2015 03:07 PM

Mojerry, change your sas session to UTF-8 (DBCS) and run that job again. It will work. The Latin-1 approach in single byte does not support Cyryllic

.

---->-- ja karman --<-----

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away