BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
heffo
Pyrite | Level 9

I'm moving some data sets from one server (Windows) to another (Linux) and I have some issues with encoding. I can still read them with the data step setting (encoding=asciiany). 

 

I would like to re-encode the data sets, permanently but the issue is that there will be potential truncation of the strings if I just do a new data set with the new encoding without changing the string lengths. Even if I doubled the length of all the string variables, I would not be able to be 100 % sure that I would not get any truncation of the data (encoding could be up to four bytes per character). Plus the fact that the data set would take up much more space than necessary. 

 

I have code to go through a data set and check for the longest string, but it only counts characters, as in lengthn(myVariable). I would like to know the length in bytes for this specific data set. 

 

Thank,

H

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Try CVP engine.

 

libname x cvp 'c:\temp' CVPMULTIPLIER=2;

proc copy ..........

 

 

Avoiding Character Data Truncation by Using the CVP Engine
When you specify the ENCODING= data set option, the encoding for the output data set might require more space than the original data set. For example, when writing DBCS data in a Windows environment using the UTF8 encoding, each DBCS character might require three bytes. To avoid data truncation, each variable must have a width that is 1.5 times greater than the width of the original data. 

When you process a SAS data file that requires transcoding, you can request that the CVP (character variable padding) engine expand character variable lengths so that character data truncation does not occur. (A variable's length is the number of bytes used to store each of the variable's values.) 

Character data truncation can occur when the number of bytes for a character in one encoding is different from the number of bytes for the same character in another encoding, such as when a single-byte character set (SBCS) is transcoded to a double-byte character set (DBCS) or to a multi-byte character set (MBCS). An SBCS represents each character in one byte, and a DBCS represents each character in two bytes. An MBCS represents characters in a varying length from one to four bytes. For example, when transcoding from Wlatin2 to a Unicode encoding, such as UTF-8, the variable lengths (in bytes) might not be sufficient to hold the values, and the result is character data truncation. 

Using the CVP engine, you specify an expansion amount so that variable lengths are expanded before transcoding, then the data is processed. Think of the CVP engine as an intermediate engine that is used to prepare the data for transcoding. After the lengths are increased, the primary engine, such as the default base engine, is used to do the actual file processing. 

The CVP engine is a read-only engine for SAS data files only. You can request character variable expansion (for example with the LIBNAME statement) in either of the following ways:

explicitly specify the CVP engine and using the default expansion of 1.5 times the variable lengths.

implicitly specifying the CVP engine with the LIBNAME options CVPBYTES= or CVPMULTIPLIER=. The options specify the expansion amount. In addition, you can use the CVPENGINE= option to specify the primary engine to use for processing the SAS file; the default is the default SAS engine.

For example, the following LIBNAME statement explicitly assigns the CVP engine. Character variable lengths are increased using the default expansion, which multiples the lengths by 1.5. For example, a character variable with a length of 10 will have a new length of 15, and a character variable with a length of 100 will have a new length of 150: 

libname expand cvp 'SAS data-library';

Note:   The expansion amount must be large enough to accommodate any expansion; otherwise, truncation will still occur.   [cautionend]

Note:   For processing that conditionally selects a subset of observations by using a WHERE expression, using the CVP engine might affect performance. Processing the file without using the CVP engine might be faster than processing the file using the CVP engine. For example, if the data set has indexes, the indexes will not be used in order to optimize the WHERE expression if you use the CVP engine.  [cautionend]

For more information and examples, see the CVP options in the LIBNAME Statement in SAS Language Reference: Dictionary.


space 

View solution in original post

3 REPLIES 3
Tom
Super User Tom
Super User

LENGTHN() will return the length in BYTES.  But only for the encoding that your are currently using.

You could try using the KCVT() function to convert the character string to another encoding and check the length of the result.

 

Ksharp
Super User

Try CVP engine.

 

libname x cvp 'c:\temp' CVPMULTIPLIER=2;

proc copy ..........

 

 

Avoiding Character Data Truncation by Using the CVP Engine
When you specify the ENCODING= data set option, the encoding for the output data set might require more space than the original data set. For example, when writing DBCS data in a Windows environment using the UTF8 encoding, each DBCS character might require three bytes. To avoid data truncation, each variable must have a width that is 1.5 times greater than the width of the original data. 

When you process a SAS data file that requires transcoding, you can request that the CVP (character variable padding) engine expand character variable lengths so that character data truncation does not occur. (A variable's length is the number of bytes used to store each of the variable's values.) 

Character data truncation can occur when the number of bytes for a character in one encoding is different from the number of bytes for the same character in another encoding, such as when a single-byte character set (SBCS) is transcoded to a double-byte character set (DBCS) or to a multi-byte character set (MBCS). An SBCS represents each character in one byte, and a DBCS represents each character in two bytes. An MBCS represents characters in a varying length from one to four bytes. For example, when transcoding from Wlatin2 to a Unicode encoding, such as UTF-8, the variable lengths (in bytes) might not be sufficient to hold the values, and the result is character data truncation. 

Using the CVP engine, you specify an expansion amount so that variable lengths are expanded before transcoding, then the data is processed. Think of the CVP engine as an intermediate engine that is used to prepare the data for transcoding. After the lengths are increased, the primary engine, such as the default base engine, is used to do the actual file processing. 

The CVP engine is a read-only engine for SAS data files only. You can request character variable expansion (for example with the LIBNAME statement) in either of the following ways:

explicitly specify the CVP engine and using the default expansion of 1.5 times the variable lengths.

implicitly specifying the CVP engine with the LIBNAME options CVPBYTES= or CVPMULTIPLIER=. The options specify the expansion amount. In addition, you can use the CVPENGINE= option to specify the primary engine to use for processing the SAS file; the default is the default SAS engine.

For example, the following LIBNAME statement explicitly assigns the CVP engine. Character variable lengths are increased using the default expansion, which multiples the lengths by 1.5. For example, a character variable with a length of 10 will have a new length of 15, and a character variable with a length of 100 will have a new length of 150: 

libname expand cvp 'SAS data-library';

Note:   The expansion amount must be large enough to accommodate any expansion; otherwise, truncation will still occur.   [cautionend]

Note:   For processing that conditionally selects a subset of observations by using a WHERE expression, using the CVP engine might affect performance. Processing the file without using the CVP engine might be faster than processing the file using the CVP engine. For example, if the data set has indexes, the indexes will not be used in order to optimize the WHERE expression if you use the CVP engine.  [cautionend]

For more information and examples, see the CVP options in the LIBNAME Statement in SAS Language Reference: Dictionary.


space 
heffo
Pyrite | Level 9

Thank you for the help. In the end I used the CVP and some other thing. I'm adding my code for others to steal! 

 

%macro fixEncoding(fromLib);
	%macro dummy();%mend dummy; *Fix syntax highlighting in the macro. Stupid but it works!;

	options dlcreatedir; *To automatically create the sub folder in the TempLib.;
	*Get the folder path from the library that we are going to copy from as well as the work libary. Keep the temporary files contained.;
	libname INLIB cvp "%sysfunc(pathname(&fromLib))"  CVPMULTIPLIER=2;
	libname TEMPLIB "%sysfunc(pathname(WORK))/temp" outencoding='UTF-8';

	*Remove all the tables if there are any in the TempLibary;
	%let tablesToKill=;
	proc sql noprint ;
        select 	memname into : tablesToKill separated by " " 
        from dictionary.tables
        where upcase(libname) = upcase("TEMPLIB");
	quit;
	*I do not like warnings in the log, if the TempLibary is empty and we try to remove all tables in there, we will get a warning.;
	%if "&tablesToKill" ne "" %then %do;
		proc datasets kill lib=TEMPLIB nodetails nolist ;
		run;
	%end;
	*Copy only the tables from the libary to our TempLibrary, but with the correct encoding and if needed longer string variables.;
	%put INFO: About to fix the encoding and save the data sets from &fromLib to templib.;
	proc copy noclone in=&fromLib out=TEMPLIB memtype=data;
	run;

	*Replace the data sets in the original library with the fixed data sets.;
	%put INFO: Going to copy the data sets back to &fromLib;
	proc copy in=TEMPLIB out=&fromLib ;
	run;

	%put INFO: Checking that we do not get any comments in the log about;
	%put INFO- Data file X.DATA is in a format that is native to another host, or the file encoding does;
    %put INFO- not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and ;
    %put INFO- might reduce performance.;
	proc sql noprint ;
        create table all_tables as
        select 	libname,
        		memname 	as SAS_table_name 		label="Table name",
        		memlabel 	as table_label 			label="Table lable",
        		nobs 		as no_of_rows  			label="Number of observations in the table" format=nlnum15.,
        		nvar 		as no_of_vars 			label="Number of variables in the table", 		
        		crdate 		as creation_dt_n 		label="Creation date" 						format=datetime19.,
        		modate 		as modification_dt_n  	label="Modification date" 					format=datetime19.
         from dictionary.tables
        where upcase(libname) = upcase ("&fromLib");
	quit;
%mend fixEncoding;

%fixEncoding(COMP);

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 1950 views
  • 2 likes
  • 3 in conversation