Hello,
we got a SAS XPORT file from our partner, which appears to be encoded in wlatin1. The issue is that our SAS runs in UTF-8. I tried the following:
libname adamxpt xport "&study_bmfs_root\Raw\ADaM\Ph2-3\adsl.xpt" access=readonly;
libname adamsas "&study_bmfs_root\Transform\ADaM" inencoding=wlatin1;
proc datasets noprint;
copy in=adamxpt out=adamsas;
select adsl;
run;
This converts to unicode fine, but truncates the variables with multibyte characters in them. I then tried the following:
libname adamxpt xport "&study_bmfs_root\Raw\ADaM\Ph2-3\adsl.xpt" access=readonly;
libname adamsas "&study_bmfs_root\Transform\ADaM" inencoding=wlatin1 outencoding=wlatin1;
proc datasets noprint;
copy in=adamxpt out=adamsas;
select adsl;
run;
to first convert to wlatin1 .sas7bdat and then use CVP engine to deal with the variable lengths, but this fails with transcoding errors.
Is there a better way than starting a separate wlatin1 SAS instance and converting the .xpt to .sas7bdat there?
I can use the XPORT engine to copy your dataset when I run it in SAS with encoding=UTF-8.
libname x xport "&fname";
proc copy inlib=x outlib=work;
run;
proc contents data=test; run;
data _null_;
set test;
put var= var=$hex.;
run;
The '80'x character is copied as is (it is NOT truncated).
So figure out how long you need to make the variable.
proc sql noprint;
select max(length(kcvt(trim(var),'wlatin1','utf-8'))) into :vlength trimmed
from test(encoding='any')
;
quit;
%put &=vlength;
Then use KCVT() to convert the values yourself.
data want;
length var $&vlength ;
set test(encoding='any');
var=kcvt(var,'wlatin1','utf-8');
run;
Result:
I am a bit confused about this statement when your subject like says the source file is Wlatin1:
but truncates the variables with multibyte characters in them.
Wlatin1 should not contain any multibyte characters. What you are likely seeing is that the existing variable lengths are too short when the single byte characters are transcoded to multiple by for UTF.
You don't show any example of using CVP so I suspect you used the CVP option but not the CVPBYTES option. By default the CVP option "expands" using a multiplier of 1.5 to increase the lengths of character variables. I think that if you add CVPBYTES=2 then each character is expanded to two bytes and should work with UTF-8.
@js5 I've tried quite a few approaches but couldn't find a resolution. It looks to me like a limitation of the xport libname engine.
If you still have access to an environment where SAS executes in single byte mode then you could first unpack the transport file there and then in another multibyte session use the libname cvp engine to convert the file to utf8.
@ballardw: The cvp engine is for SAS files but unfortunately the xport engine already creates the SAS file with the truncated value so it's too late. Proc Cimport would allow for a multiplier but it doesn't work for transport files that have been created with the xport engine. And also the SAS supplied macro %XPT2LOC() doesn't get it right.
If someone wants to give it a go in a multibyte session: I created the attached file adsl.xpt in a single byte session using below code.
%put %sysfunc(getoption(encoding,keyword));
libname adamxpt xport "c:\temp\transport\adsl.xpt";
data work.test;
length var $1;
var='80'x;
run;
proc print data=work.test;
run;
data adamxpt.test;
set work.test;
run;
204 %put %sysfunc(getoption(encoding,keyword)); ENCODING=WLATIN1 205 libname adamxpt xport "c:\temp\transport\adsl.xpt"; NOTE: Libref ADAMXPT was successfully assigned as follows: Engine: XPORT Physical Name: c:\temp\transport\adsl.xpt 206 data work.test; 207 length var $1; 208 var='80'x; 209 run; NOTE: The data set WORK.TEST has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds 210 proc print data=work.test; 211 run; NOTE: There were 1 observations read from the data set WORK.TEST. NOTE: PROCEDURE PRINT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 212 data adamxpt.test; 213 set work.test; 214 run; NOTE: There were 1 observations read from the data set WORK.TEST. NOTE: The data set ADAMXPT.TEST has 1 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
I can use the XPORT engine to copy your dataset when I run it in SAS with encoding=UTF-8.
libname x xport "&fname";
proc copy inlib=x outlib=work;
run;
proc contents data=test; run;
data _null_;
set test;
put var= var=$hex.;
run;
The '80'x character is copied as is (it is NOT truncated).
So figure out how long you need to make the variable.
proc sql noprint;
select max(length(kcvt(trim(var),'wlatin1','utf-8'))) into :vlength trimmed
from test(encoding='any')
;
quit;
%put &=vlength;
Then use KCVT() to convert the values yourself.
data want;
length var $&vlength ;
set test(encoding='any');
var=kcvt(var,'wlatin1','utf-8');
run;
Result:
@Tom Thanks for testing and your approach works for me as well.
Using an editor to look into the transport .xpt file created using the xport libname engine it appears only the source OS gets stored but not the source encoding and though the target session encoding gets used when creating the table while maintaining the source hex values. ...but then of course trying to print an ANSI hex value stored in a variable of a utf8 encoded table won't display the ANSI representation of this value.
Do you know if this need for post processing the extracted table is anywhere documented? I couldn't find anything.
I also couldn't find a way to select another encoding for the extracted table (like wlatin1) which also would circumvent the issue.
There is no place to store an encoding in a V5 transport file. I don't think that the extended V7/8/9 format transport files supported by the %LOC2XPT and %XPT2LOC autocall macros have a place to store an encoding either.
So just use the ENCODING= dataset option, like in my example code. That might prevent you from using PROC COPY. Although you might want to investigate the OUTENCODING option and see if that can work with PROC COPY.
Your dataset with a one byte character variable demonstrates the limitations of the CVP engine. To work that particular dataset you would have to set the factor to 3, which would make every character variable three times as long. And you might even be able to file some encoding where a single byte character would require 4 bytes in UTF-8 encoding.
In general if you want to make dataset to send to the FDA it is probably best to limit the characters used to 7-bit ASCII codes so that encoding does not come into play.
This works for me too, thanks! Given that it is a one-off conversion and it turns out we have a server instance running in wlatin1, I believe I will go with the approach suggested by @Patrick: unpack the file in wlatin1 system and use CVP engine to process it under UTF-8 instance.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.