BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
xxformat_com
Barite | Level 11

Hi,

 

 When importing an Excel file, the excel engine multiplies by 3 the size of character variables. If we attempt to reduce the length using dbsastype= option, values get truncated.

 

Do you know how to avoid id or at least how to reduce the length afterwards without altering the data ?

 

proc export data=sashelp.class 
            outfile="&test./reporting/test.xlsx" 
            dbms=xlsx 
            replace;
    sheet=class;
run;

libname demo xlsx "&test./reporting/test.xlsx";
*libname demo excel "&test./reporting/test.xlsx";

ods select position;
proc contents data=demo.class varnum;
run;

proc print data=demo.class;
run;
libname demo clear;

excel.JPG

It is ok with xlsx engine (see below). I'm just trying to understand excel engine here.

xlsx.JPG

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

The Excel engine uses a module provided by Microsoft, so don't be surprised there (whenever you're involved with stuff coming from Redmond, WA, don't be surprised). It probably suspects the existence of UTF characters and makes room for them.

The XLSX engine is created by following the documentation of the .xlsx file format, and programmed by the SAS people. 

View solution in original post

13 REPLIES 13
Kurt_Bremser
Super User

The Excel engine uses a module provided by Microsoft, so don't be surprised there (whenever you're involved with stuff coming from Redmond, WA, don't be surprised). It probably suspects the existence of UTF characters and makes room for them.

The XLSX engine is created by following the documentation of the .xlsx file format, and programmed by the SAS people. 

xxformat_com
Barite | Level 11

@Kurt_Bremser Yes I agree. There are just a few situations where excel dbms/engine is able to do things xlsx dbms/engine  is not able to do like appending data in an existing dataset available in a given Excel worksheet.

SASKiwi
PROC Star

A reliable fix for this is to save your Excel spreadsheet as a CSV then change your PROC IMPORT to read the CSV.

Quentin
Super User

What is the encoding of your sas session? 

%put &=sysencoding ;

I can replicate your results if I use a multi-byte encoding such as UTF-8, but if I use a single-byte encoding such as WLATIN1 I get a length of 1 from the EXCEL engine. 

 

If your SAS session is multi-byte, I think the EXCEL engine is actually doing the right thing, it will allow you to import multi-byte character values that might appear in your excel file.  I suppose if you don't like this, you could always write a macro that would look up the length of each character variable, then use a DATA step with a LENGTH statement to set the lengths to 1/3 of the imported length.

 

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
xxformat_com
Barite | Level 11

Yes, I'm using UTF-8.

What was surprising me is that if I reduce the length in a data step to 10, the value with 10 caracters gets truncated.

Ksharp
Super User
You must have multi-byte charactes.
In unicode encoding there are three byte for one character(Like Chinese Japanese).
Therefore :
7*3=21
1*3=3

7 and 1 are for English SAS(ASCII)
21 and 3 are for Chinese SAS(UNICODE)

But 7*2=14 and 1*2=2 for Chinese SAS(EUC-CN)

All these are depended on your sas encoding.
xxformat_com
Barite | Level 11

I've tested it with sashelp.class which has no multibytes characters tho.

Ksharp
Super User
But you are using UNICODE SAS.
Any character (even a English character) in UNICODE encoding would have stored length 3.
Tom
Super User Tom
Super User

@Ksharp wrote:
But you are using UNICODE SAS.
Any character (even a English character) in UNICODE encoding would have stored length 3.

Not true.  All printable 7-bit ASCII characters use only one byte in UTF-8.  That is one of the advantages of it over double byte encodings (commonly used for Chinese and Japanese) that always use two bytes for every character.

 

Other UTF-8 characters require 2,3 or 4 bytes.

Quentin
Super User

@xxformat_com wrote:

I've tested it with sashelp.class which has no multibytes characters tho.


Can you show the code and log you used to test?  

 

When I test with sashelp.class using the EXCEL engine to import, it imports NAME with a length of $21, but if I then use a DATA step to cut it back to a length of $7, nothing is truncated.  There is a warning in the log about possible truncation, but that warning can be turned off by setting options varlenchk=nowarn.

 

Code:

%let test=Q:\Junk ;

%put &=sysencoding ;

proc export data=sashelp.class 
            outfile="&test./test.xlsx" 
            dbms=xlsx 
            replace;
    sheet=class;
run;

libname demo excel "&test./test.xlsx";

proc contents data=demo.class varnum;
run;

***options varlenchk=nowarn ;
data class ;
  length name $7 ;
  set demo.class ;

  put name=;
run ;

proc contents data=class ;
run ;

libname demo clear ;

log:

1    %let test=Q:\Junk ;
2
3    %put &=sysencoding ;
SYSENCODING=utf-8
4
5    proc export data=sashelp.class
6                outfile="&test./test.xlsx"
7                dbms=xlsx
8                replace;
9        sheet=class;
10   run;

NOTE: The export data set has 19 observations and 5 variables.
NOTE: "Q:\Junk\test.xlsx" file was successfully created.
NOTE: PROCEDURE EXPORT used (Total process time):
      real time           0.11 seconds
      cpu time            0.01 seconds


11
12   libname demo excel "&test./test.xlsx";
NOTE: Libref DEMO was successfully assigned as follows:
      Engine:        EXCEL
      Physical Name: Q:\Junk/test.xlsx
13
14   proc contents data=demo.class varnum;
15   run;

NOTE: PROCEDURE CONTENTS used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


16
17   ***options varlenchk=nowarn ;
18   data class ;
19     length name $7 ;
20     set demo.class ;
21
22     put name=;
23   run ;

WARNING: Multiple lengths were specified for the variable name by input data set(s). This can cause truncation of data.
name=Alfred
name=Alice
name=Barbara
name=Carol
name=Henry
name=James
name=Jane
name=Janet
name=Jeffrey
name=John
name=Joyce
name=Judy
name=Louise
name=Mary
name=Philip
name=Robert
name=Ronald
name=Thomas
name=William
NOTE: There were 19 observations read from the data set DEMO.class.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


24
25   proc contents data=class ;
26   run ;

NOTE: PROCEDURE CONTENTS used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Quentin
Super User

I played a bit more with this,  in a utf-8 session with the string str="€1".  In a UTF-8 session the euro symbol gets 3 bytes, so SAS makes str $4.  Three bytes for the euro symbol, one byte for the 1.  

 

If I write it out to .xlsx and then read it in with the EXCEL engine, it comes in as $6.  So the EXCEL engine is giving each character 3 bytes, regardless of whether or not it needs it.

 

If I read it with the XLSX engine it comes in as $4.  So apparently the XLSX engine is either assessing each character to figure out how many bytes it will need, or it is using some other metadata in the xlsx file to determine the number of bytes that SAS will need to store the data. 

 

Test code like:

%let test=Q:\Junk ;

%put &=sysencoding ;

data euro ;
  str="€1" ; *Euro symbol takes 3 bytes, so in utf-8 this string is 4 bytes long;
  put str= $hex12. ;
run ;

proc contents data=euro ;
run ;

proc export data=euro
            outfile="&test./test.xlsx" 
            dbms=xlsx 
            replace;
    sheet=euro;
run;

*XLSX engine reads str as $4 . It must look at each character and assess how many bytes it needs, or use excel metadata to determine length;
libname xlsx xlsx "&test./test.xlsx";
proc contents data=xlsx.euro varnum;
run;

data _null_ ;
  set xlsx.euro ;
  put str= $hex12. ;
run ; 
libname xlsx  clear ;

*Excel engine reads str as $6, because it gives 3 bytes per character ;
libname excel excel "&test./test.xlsx";
proc contents data=excel.euro varnum;
run;

data _null_ ;
  set excel.euro ;
  put str= $hex12. ;
run ; 
libname excel clear ;
BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 878 views
  • 4 likes
  • 6 in conversation