Solved: Find the byte lenght of a string in a table with different encoding

heffo · Posted 06-18-2019 07:59 PM

I'm moving some data sets from one server (Windows) to another (Linux) and I have some issues with encoding. I can still read them with the data step setting (encoding=asciiany).

I would like to re-encode the data sets, permanently but the issue is that there will be potential truncation of the strings if I just do a new data set with the new encoding without changing the string lengths. Even if I doubled the length of all the string variables, I would not be able to be 100 % sure that I would not get any truncation of the data (encoding could be up to four bytes per character). Plus the fact that the data set would take up much more space than necessary.

I have code to go through a data set and check for the longest string, but it only counts characters, as in lengthn(myVariable). I would like to know the length in bytes for this specific data set.

Thank,

H

Ksharp · Posted 06-19-2019 09:22 AM

Try CVP engine.

libname x cvp 'c:\temp' CVPMULTIPLIER=2;

proc copy ..........

Avoiding Character Data Truncation by Using the CVP Engine
When you specify the ENCODING= data set option, the encoding for the output data set might require more space than the original data set. For example, when writing DBCS data in a Windows environment using the UTF8 encoding, each DBCS character might require three bytes. To avoid data truncation, each variable must have a width that is 1.5 times greater than the width of the original data.

When you process a SAS data file that requires transcoding, you can request that the CVP (character variable padding) engine expand character variable lengths so that character data truncation does not occur. (A variable's length is the number of bytes used to store each of the variable's values.)

Character data truncation can occur when the number of bytes for a character in one encoding is different from the number of bytes for the same character in another encoding, such as when a single-byte character set (SBCS) is transcoded to a double-byte character set (DBCS) or to a multi-byte character set (MBCS). An SBCS represents each character in one byte, and a DBCS represents each character in two bytes. An MBCS represents characters in a varying length from one to four bytes. For example, when transcoding from Wlatin2 to a Unicode encoding, such as UTF-8, the variable lengths (in bytes) might not be sufficient to hold the values, and the result is character data truncation.

Using the CVP engine, you specify an expansion amount so that variable lengths are expanded before transcoding, then the data is processed. Think of the CVP engine as an intermediate engine that is used to prepare the data for transcoding. After the lengths are increased, the primary engine, such as the default base engine, is used to do the actual file processing.

The CVP engine is a read-only engine for SAS data files only. You can request character variable expansion (for example with the LIBNAME statement) in either of the following ways:

explicitly specify the CVP engine and using the default expansion of 1.5 times the variable lengths.

implicitly specifying the CVP engine with the LIBNAME options CVPBYTES= or CVPMULTIPLIER=. The options specify the expansion amount. In addition, you can use the CVPENGINE= option to specify the primary engine to use for processing the SAS file; the default is the default SAS engine.

For example, the following LIBNAME statement explicitly assigns the CVP engine. Character variable lengths are increased using the default expansion, which multiples the lengths by 1.5. For example, a character variable with a length of 10 will have a new length of 15, and a character variable with a length of 100 will have a new length of 150:

libname expand cvp 'SAS data-library';

Note: The expansion amount must be large enough to accommodate any expansion; otherwise, truncation will still occur. [cautionend]

Note: For processing that conditionally selects a subset of observations by using a WHERE expression, using the CVP engine might affect performance. Processing the file without using the CVP engine might be faster than processing the file using the CVP engine. For example, if the data set has indexes, the indexes will not be used in order to optimize the WHERE expression if you use the CVP engine. [cautionend]

For more information and examples, see the CVP options in the LIBNAME Statement in SAS Language Reference: Dictionary.

space

View solution in original post

Tom · Posted 06-18-2019 10:24 PM

LENGTHN() will return the length in BYTES. But only for the encoding that your are currently using.

You could try using the KCVT() function to convert the character string to another encoding and check the length of the result.

Ksharp · Posted 06-19-2019 09:22 AM

Try CVP engine.

libname x cvp 'c:\temp' CVPMULTIPLIER=2;

proc copy ..........