Solved: Re: Importing a wlatin1 XPT file on a system running in UTF-8

js5 · Posted 04-19-2024 06:29 AM

Hello,

we got a SAS XPORT file from our partner, which appears to be encoded in wlatin1. The issue is that our SAS runs in UTF-8. I tried the following:

libname adamxpt xport "&study_bmfs_root\Raw\ADaM\Ph2-3\adsl.xpt" access=readonly;
libname adamsas "&study_bmfs_root\Transform\ADaM" inencoding=wlatin1;

proc datasets noprint;
	copy in=adamxpt out=adamsas;
	select adsl;
run;

This converts to unicode fine, but truncates the variables with multibyte characters in them. I then tried the following:

libname adamxpt xport "&study_bmfs_root\Raw\ADaM\Ph2-3\adsl.xpt" access=readonly;
libname adamsas "&study_bmfs_root\Transform\ADaM" inencoding=wlatin1 outencoding=wlatin1;

proc datasets noprint;
	copy in=adamxpt out=adamsas;
	select adsl;
run;

to first convert to wlatin1 .sas7bdat and then use CVP engine to deal with the variable lengths, but this fails with transcoding errors.

Is there a better way than starting a separate wlatin1 SAS instance and converting the .xpt to .sas7bdat there?

Tom · Posted 04-21-2024 09:27 AM

I can use the XPORT engine to copy your dataset when I run it in SAS with encoding=UTF-8.

libname x xport "&fname";
proc copy inlib=x outlib=work;
run;
proc contents data=test; run;

data _null_;
 set test;
 put var= var=$hex.;
run;

The '80'x character is copied as is (it is NOT truncated).

So figure out how long you need to make the variable.

proc sql noprint;
select max(length(kcvt(trim(var),'wlatin1','utf-8'))) into :vlength trimmed
from test(encoding='any')
;
quit;
%put &=vlength;

Then use KCVT() to convert the values yourself.

data want;
  length var $&vlength ;
  set test(encoding='any');
  var=kcvt(var,'wlatin1','utf-8');
run;

Result:

View solution in original post

ballardw · Posted 04-19-2024 11:06 AM

I am a bit confused about this statement when your subject like says the source file is Wlatin1:

but truncates the variables with multibyte characters in them.

Wlatin1 should not contain any multibyte characters. What you are likely seeing is that the existing variable lengths are too short when the single byte characters are transcoded to multiple by for UTF.

You don't show any example of using CVP so I suspect you used the CVP option but not the CVPBYTES option. By default the CVP option "expands" using a multiplier of 1.5 to increase the lengths of character variables. I think that if you add CVPBYTES=2 then each character is expanded to two bytes and should work with UTF-8.

Patrick · Posted 04-21-2024 12:16 AM

@js5 I've tried quite a few approaches but couldn't find a resolution. It looks to me like a limitation of the xport libname engine.

If you still have access to an environment where SAS executes in single byte mode then you could first unpack the transport file there and then in another multibyte session use the libname cvp engine to convert the file to utf8.

@ballardw: The cvp engine is for SAS files but unfortunately the xport engine already creates the SAS file with the truncated value so it's too late. Proc Cimport would allow for a multiplier but it doesn't work for transport files that have been created with the xport engine. And also the SAS supplied macro %XPT2LOC() doesn't get it right.

If someone wants to give it a go in a multibyte session: I created the attached file adsl.xpt in a single byte session using below code.

%put %sysfunc(getoption(encoding,keyword));
libname adamxpt xport "c:\temp\transport\adsl.xpt";
data work.test;
	length var $1;
	var='80'x; 
run;
proc print data=work.test;
run;
data adamxpt.test;
	set work.test;
run;

204  %put %sysfunc(getoption(encoding,keyword));
ENCODING=WLATIN1
205  libname adamxpt xport "c:\temp\transport\adsl.xpt";
NOTE: Libref ADAMXPT was successfully assigned as follows:
      Engine:        XPORT
      Physical Name: c:\temp\transport\adsl.xpt
206  data work.test;
207    length var $1;
208    var='80'x;
209  run;

NOTE: The data set WORK.TEST has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


210  proc print data=work.test;
211  run;

NOTE: There were 1 observations read from the data set WORK.TEST.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


212  data adamxpt.test;
213    set work.test;
214  run;

NOTE: There were 1 observations read from the data set WORK.TEST.
NOTE: The data set ADAMXPT.TEST has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Tom · Posted 04-21-2024 09:27 AM

I can use the XPORT engine to copy your dataset when I run it in SAS with encoding=UTF-8.

libname x xport "&fname";
proc copy inlib=x outlib=work;
run;
proc contents data=test; run;

data _null_;
 set test;
 put var= var=$hex.;
run;

The '80'x character is copied as is (it is NOT truncated).

So figure out how long you need to make the variable.

proc sql noprint;
select max(length(kcvt(trim(var),'wlatin1','utf-8'))) into :vlength trimmed
from test(encoding='any')
;
quit;
%put &=vlength;

Then use KCVT() to convert the values yourself.

data want;
  length var $&vlength ;
  set test(encoding='any');
  var=kcvt(var,'wlatin1','utf-8');
run;

Result:

Patrick · Posted 04-21-2024 11:38 PM

@Tom Thanks for testing and your approach works for me as well.

Using an editor to look into the transport .xpt file created using the xport libname engine it appears only the source OS gets stored but not the source encoding and though the target session encoding gets used when creating the table while maintaining the source hex values. ...but then of course trying to print an ANSI hex value stored in a variable of a utf8 encoded table won't display the ANSI representation of this value.

Do you know if this need for post processing the extracted table is anywhere documented? I couldn't find anything.

I also couldn't find a way to select another encoding for the extracted table (like wlatin1) which also would circumvent the issue.

Tom · Posted 04-21-2024 11:52 PM

There is no place to store an encoding in a V5 transport file. I don't think that the extended V7/8/9 format transport files supported by the %LOC2XPT and %XPT2LOC autocall macros have a place to store an encoding either.

So just use the ENCODING= dataset option, like in my example code. That might prevent you from using PROC COPY. Although you might want to investigate the OUTENCODING option and see if that can work with PROC COPY.

Your dataset with a one byte character variable demonstrates the limitations of the CVP engine. To work that particular dataset you would have to set the factor to 3, which would make every character variable three times as long. And you might even be able to file some encoding where a single byte character would require 4 bytes in UTF-8 encoding.

In general if you want to make dataset to send to the FDA it is probably best to limit the characters used to 7-bit ASCII codes so that encoding does not come into play.

js5 · Posted 04-25-2024 05:40 AM

This works for me too, thanks! Given that it is a one-off conversion and it turns out we have a server instance running in wlatin1, I believe I will go with the approach suggested by @Patrick: unpack the file in wlatin1 system and use CVP engine to process it under UTF-8 instance.

Registration is open

SAS Training: Just a Click Away