Hi everybody
I have a problem with reading external files, and I hope that somebody with a better understanding of encoding issues can help me out. The files are exported as delimited files from an external system outside our control, and the files are in utf8-encoding, but contain some strings with a different encoding. We use SAS 9.4M5 Linux GRID, with system encoding=Latin9.
I made a small test file (attached) . It looks like this in VI editor, and that is what I want as output:
"813"#"Afsluttet"#"Klavs Hansen"#"Elev ½ tid)" "445"#"I gang"#"UU Nordvestsjælland"#"SSH´er" "427"#"Afbrudt"#"Systemoverførsel"#"VVS´er"
When I read it into SAS with UTF-8 encoding, I get the national characters æ and ø in field 3 correct, but run into problems with the special characters in field 4:
%let file = /sasdata/udvk/data_beskyt/ungevejledning_beskyt/1_grunddata/eksterne_filer/test.txt;
filename ind "&file" encoding="utf-8";
data test;
infile ind dsd dlm="#" truncover;
informat id 8. status $char30. source $char30. udd $char30.;
input id status source udd;
run;
NOTE: The infile IND is:
Filename=/sasdata/udvk/data_beskyt/ungevejledning_beskyt/1_grunddata/eksterne_filer/test.txt,
Owner Name=sasbatch,
Group Name=torg-odk-sas9-etl,
Access Permission=-rw-r--r--,
Last Modified=01. april 2019 18:20:34,
File Size (bytes)=141
WARNING: A character that could not be transcoded has been replaced in record 1.
WARNING: A character that could not be transcoded has been replaced in record 2.
WARNING: A character that could not be transcoded has been replaced in record 3.
NOTE: 3 records were read from the infile IND.
The minimum record length was 43.
The maximum record length was 46.
NOTE: The data set WORK.TEST has 3 observations and 4 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds
Note that the problematic characters are lost, they are all translated to hex 1A.
The characters are double-byte with hex values C2BD or C2B4 in this example. There are several others in real data, but all of the same type.
If I read the files with system encoding (latin9), I get the two bytes, so they could be handled in the program, but I also get all the valid UTF-8 characters that way, as in the following example.
The files are large, millions of records, and delivered daily, and we will every now and then get new double-byte characters, both valid and invalid in UTF-8, so it will be and endless maintenance task to read the files with latin encoding and idenfify and change all double-byte characters. So that is not really an option.
But because the VI editor can display all characters correct, i think it should be possible i SAS also, so I must be missing something. All suggestions will be highly appreciated.
... View more