Solved: Re: Encoding problem while reading delimited file

ErikLund_Jensen · Posted 04-01-2019 02:32 PM

Hi everybody

I have a problem with reading external files, and I hope that somebody with a better understanding of encoding issues can help me out. The files are exported as delimited files from an external system outside our control, and the files are in utf8-encoding, but contain some strings with a different encoding. We use SAS 9.4M5 Linux GRID, with system encoding=Latin9.

I made a small test file (attached) . It looks like this in VI editor, and that is what I want as output:

"813"#"Afsluttet"#"Klavs Hansen"#"Elev ½ tid)"
"445"#"I gang"#"UU Nordvestsjælland"#"SSH´er"
"427"#"Afbrudt"#"Systemoverførsel"#"VVS´er"

When I read it into SAS with UTF-8 encoding, I get the national characters æ and ø in field 3 correct, but run into problems with the special characters in field 4:

%let file = /sasdata/udvk/data_beskyt/ungevejledning_beskyt/1_grunddata/eksterne_filer/test.txt;
filename ind "&file" encoding="utf-8";
data test; 
	infile ind dsd dlm="#" truncover;
	informat id 8. status $char30. source $char30. udd $char30.;
	input id status source udd;
run;

NOTE: The infile IND is:
      Filename=/sasdata/udvk/data_beskyt/ungevejledning_beskyt/1_grunddata/eksterne_filer/test.txt,
      Owner Name=sasbatch,
      Group Name=torg-odk-sas9-etl,
      Access Permission=-rw-r--r--,
      Last Modified=01. april 2019 18:20:34,
      File Size (bytes)=141

WARNING: A character that could not be transcoded has been replaced in record 1.
WARNING: A character that could not be transcoded has been replaced in record 2.
WARNING: A character that could not be transcoded has been replaced in record 3.
NOTE: 3 records were read from the infile IND.
      The minimum record length was 43.
      The maximum record length was 46.
NOTE: The data set WORK.TEST has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds

Note that the problematic characters are lost, they are all translated to hex 1A.

The characters are double-byte with hex values C2BD or C2B4 in this example. There are several others in real data, but all of the same type.

If I read the files with system encoding (latin9), I get the two bytes, so they could be handled in the program, but I also get all the valid UTF-8 characters that way, as in the following example.

The files are large, millions of records, and delivered daily, and we will every now and then get new double-byte characters, both valid and invalid in UTF-8, so it will be and endless maintenance task to read the files with latin encoding and idenfify and change all double-byte characters. So that is not really an option.

But because the VI editor can display all characters correct, i think it should be possible i SAS also, so I must be missing something. All suggestions will be highly appreciated.

Tom · Posted 04-01-2019 04:45 PM

You are not going to be able to store more than 256 distinct characters into a system that is using a single byte per character.

If you cannot convince everyone to just go back to 1980 and only use 7 bit ASCII codes then you will probably need to convert your database to use UTF-8.

In the meantime you could try to build a process that will ingest the UTF-8 files into LATIN encoding and either reject the ones that don't convert or just live with the loss of precision in the data. You might be able to store the full utf-8 text as hex or possibly using HTMLENCODE() function, but then the target field will need to be large enough to store the longer strings that would require.

For example here s code to convert your current 4 column file with # delimiter into a CSV file with the values encoded with HTMLENCODE() function.

data _null_;
 length x1-x4 $1000;
 infile "&path/utf8_test.txt" dsd dlm='#' truncover;
 file "&path/html_test.csv" dsd ;
 input x1-x3;
 array x x1-x4;
 do over x;
   x=htmlencode(trim(x),'7bit');
 end;
 put x1-x4;
run;

Results:

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6
1         813,Afsluttet,Klavs Hansen, 27
2         445,I gang,UU Nordvestsj&#xE6;lland, 36
3         427,Afbrudt,Systemoverf&#xF8;rsel, 34

View solution in original post

Tom · Posted 04-01-2019 02:36 PM

You need to run SAS using UTF-8 encoding if you want to be able use UTF-8 text as character values. Otherwise SAS will try to convert the text to match the encoding you are using.

So is the source file encoded with UTF-8 or not?

If not what encoding is it using?

What do you see in the file when you read it as binary?

data _null_;
  infile "&file" recfm=f lrecl=50 obs=10;
  input;
  list;
run;

You could try reading it yourself and parsing on the # characters and end-of-line characters yourself.

Perhaps converting the text to $HEX representations so you could work with it easier in a single byte encoding session without causing errors.

ErikLund_Jensen · Posted 04-01-2019 03:35 PM

Hi @Tom

Thank you. I guess that the file is in UTF8-encoding, because I tried reading the file in an UTF-8 encoded session, and my data set looks fine, just like the VI cut in my first post, but I don't know how to determine the actual encoding.

The binary reading in the UTF-8 session gave C2BD and C2B4 like my latin-reading:

1 CHAR "813"#"Afsluttet"#"Klavs Hansen"#"Elev Â½ tid)"."4
ZONE 233322246767776722246677246676622246672CB276622023
NUMR 2813232163C54454232BC163081E35E2325C5602D049492A24

2 CHAR 45"#"I gang"#"UU NordvestsjÃ¦lland"#"SSHÂ´er"."427
ZONE 332224266662225524676767776CA66666222554CB67202333
NUMR 452329071E7232550EF2465343A36CC1E423233824522A2427

But now I am stuck again, because the same problem occurs when I try to get data into a latin9-encoded session. If I copy from the UTF-8 encoded session with outencoding=latin9, I just get "Some character data was lost during transcoding, and the step terminates abnormally.

Our batch runs in Latin9. Of course it is possible to code the batch jobs to spawn other processes in UTF-8 to read the files, but data is going into a big data warehouse as input to an ETL process involving other data sets in latin-9 encoding. so data has to be brought in a usable form seen from a latin-9 session before that.

Tom · Posted 04-01-2019 04:45 PM

You are not going to be able to store more than 256 distinct characters into a system that is using a single byte per character.

If you cannot convince everyone to just go back to 1980 and only use 7 bit ASCII codes then you will probably need to convert your database to use UTF-8.

In the meantime you could try to build a process that will ingest the UTF-8 files into LATIN encoding and either reject the ones that don't convert or just live with the loss of precision in the data. You might be able to store the full utf-8 text as hex or possibly using HTMLENCODE() function, but then the target field will need to be large enough to store the longer strings that would require.

For example here s code to convert your current 4 column file with # delimiter into a CSV file with the values encoded with HTMLENCODE() function.

data _null_;
 length x1-x4 $1000;
 infile "&path/utf8_test.txt" dsd dlm='#' truncover;
 file "&path/html_test.csv" dsd ;
 input x1-x3;
 array x x1-x4;
 do over x;
   x=htmlencode(trim(x),'7bit');
 end;
 put x1-x4;
run;

Results:

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6
1         813,Afsluttet,Klavs Hansen, 27
2         445,I gang,UU Nordvestsj&#xE6;lland, 36
3         427,Afbrudt,Systemoverf&#xF8;rsel, 34

ErikLund_Jensen · Posted 04-04-2019 01:47 PM

Hi @Tom

Thanks for your suggestions.

Unfortunately your last code example gives the same problem, when I run it. I get:

WARNING: A character that could not be transcoded has been replaced in record 1.
813 Afsluttet Klavs Hansen
WARNING: A character that could not be transcoded has been replaced in record 2.
445 I gang UU Nordvestsj&#xE6;lland
WARNING: A character that could not be transcoded has been replaced in record 3.
427 Afbrudt Systemoverf&#xF8;rsel

But your code only reads x1-x3. It translates the danish national characters in x3 to html, but they don't give problems. If I change to input x1-x4, the different problematic characters are all translated to a small right arrow.

WARNING: A character that could not be transcoded has been replaced in record 1.
813 Afsluttet Klavs Hansen Elev  tid)
WARNING: A character that could not be transcoded has been replaced in record 2.
445 I gang UU Nordvestsj&#xE6;lland SSHer
WARNING: A character that could not be transcoded has been replaced in record 3.
427 Afbrudt Systemoverf&#xF8;rsel VVSer

Without translation, I my danish characters right, but the same result in x4:

WARNING: A character that could not be transcoded has been replaced in record 1.
813 Afsluttet Klavs Hansen Elev  tid)
WARNING: A character that could not be transcoded has been replaced in record 2.
445 I gang UU Nordvestsjælland SSHer
WARNING: A character that could not be transcoded has been replaced in record 3.
427 Afbrudt Systemoverførsel VVSer

Seven bit is not an option, because it does not cover the full danish alphabet, nor is changing our whole setup to unicode (30000 sas tables with 4000+ batchjobs + a huge amount of EG projects and user data), so I followed your suggestion and tried to build a process that will ingest the UTF-8 files into LATIN encoding.

I used iconv and tried to convert to ISO8859-15, which is supposed to be the system encoding, but that didn't work either, so I tried to convert to ISO8859-1. And then everything worked, and I got my SAS data without losing any characters. So now my jobs are up and running with a call to iconv before the data steps.

So case closed, thanks to your suggestion.

Tom · Posted 04-04-2019 04:15 PM

That's a good solution. Let iconv deal with the invalid codes and generate a readable file for the encoding that your SAS session and database are using.

Classroom Training Available!