I've never really made friends with the import of txt files. And luckily I can handle 99% of my data intake through a DB connector with Oracle, so normally I can forget about my shortcomings.
The problem of wrongly interpreted special characters is already present in the txt file.
But it gets worse as the lines with special characters do not get assigned the correct column. It gets displaced after encountering a special character like 'ó' and the text is broken at this position.
the string which cause such behavior is:
Comunicación 1 año
Comunicación 1 año
FILENAME REFFILE FILESRVC FOLDERPATH='/Projects/crm reporting/Encuestas' FILENAME='DV_Resultado_Campaña_ac.txt' encoding='wlatin1';
data as.DV_RES_CAMP;
%let _EFIERR_ = 0;
infile REFFILE delimiter='09'x dsd MISSOVER lrecl=32767 firstobs=2;
informat JourneyName $80.;
informat EmailName $100.;
informat Mes_Envio best32.;
informat num_envio best32.;
informat num_open best32.;
informat num_click best32.;
informat num_Unsubscribe best32.;
informat num_caso_renovar best32.;
informat num_caso_refinanciar best32.;
informat num_caso_pagaralvencimiento best32.;
informat num_caso_devolver best32.;
informat Num_Caso_masinfo best32.;
informat Num_Caso_QuedarseloFF best32.;
informat num_caso_Ampliar best32.;
informat num_Harbounce best32.;
informat num_Softbounce best32.;
informat click_renovar best32.;
informat click_devolver best32.;
informat click_me_lo_quedo best32.;
informat Click_Refinanciar best32.;
informat Click_PagarAlVencimiento best32.;
informat click_info best32.;
informat click_portal best32.;
informat click_baja best32.;
informat click_ampliar best32.;
informat Click_QuedarseloFF best32.;
informat click_encuesta best32.;
informat click_oferta best32.;
informat Click_ViewM best32.;
informat Marca $10.;
format JourneyName $80.;
format EmailName $100.;
format Mes_Envio best12.;
format num_envio best12.;
format num_open best12.;
format num_click best12.;
format num_Unsubscribe best12.;
format num_caso_renovar best12.;
format num_caso_refinanciar best12.;
format num_caso_pagaralvencimiento best12.;
format num_caso_devolver best12.;
format Num_Caso_masinfo best12.;
format Num_Caso_QuedarseloFF best12.;
format num_caso_Ampliar best12.;
format num_Harbounce best12.;
format num_Softbounce best12.;
format click_renovar best12.;
format click_devolver best12.;
format click_me_lo_quedo best12.;
format Click_Refinanciar best32.;
format Click_PagarAlVencimiento best32.;
format click_info best12.;
format click_portal best12.;
format click_baja best12.;
format click_ampliar best12.;
format Click_QuedarseloFF best12.;
format click_encuesta best32.;
format click_oferta best12.;
format Click_ViewM best12.;
format Marca $10.;
input
JourneyName $
EmailName $
Mes_Envio
num_envio
num_open
num_click
num_Unsubscribe
num_caso_renovar
num_caso_refinanciar
num_caso_pagaralvencimiento
num_caso_devolver
Num_Caso_masinfo
Num_Caso_QuedarseloFF
num_caso_Ampliar
num_Harbounce
num_Softbounce
click_renovar
click_devolver
click_me_lo_quedo
Click_Refinanciar
Click_PagarAlVencimiento
click_info
click_portal
click_baja
click_ampliar
Click_QuedarseloFF
click_encuesta
click_oferta
Click_ViewM
Marca $
;
if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
run;
So you have two different variations on spelling of that Communication word. One is using UTF-8 encoding and one is using some single byte encoding, such as WLATIN1. You can see that with this test program:
data test;
str1='ción';
str2='ción';
length str1_wlatin str2_utf8 $8;
str1_wlatin=kcvt(str1,'utf-8','wlatin1');
str2_utf8=kcvt(str2,'wlatin1','utf-8');
put (str:) (=/) (str:) (/$hex.);
run;
Results
str1=ción str2=ción str1_wlatin=ción str2_utf8=ción 6369C3B36E 6369F36E 6369F36E20202020 6369C3B36E202020
It will probably be best to read this file using a SAS session running with system option ENCODING set to UTF-8. (You can store any character from a single byte encoding into UTF-8 encoding, but you are not assured of being able to go the other way.)
So first read the file use ENCODING=ANY on the FILENAME statement so that SAS does not try to transcode the characters while reading them into the variables.
Then see if you can detect which encoding each observation is using. For example by checking for the 'C3'X byte that UTF-8 uses. Then use KCVT() function to convert the WLATIN1 strings into UTF8 strings.
If you are positive there a no characters in the UTF-8 strings that cannot be transcoded into WLATIN1 you could run with system ENCODING setting set to WLATIN1 and transcode the UTF-8 strings to WLATIN1.
This code appears to have been generated by PROC IMPORT. I suspect that the input file includes some double-byte characters - in may be UTF-8 encoded instead of WLATIN-1. Try changing the FILENAME statement to specify UTF8:
FILENAME REFFILE FILESRVC FOLDERPATH='/Projects/crm reporting/Encuestas' FILENAME='DV_Resultado_Campaña_ac.txt' encoding='UTF8';
Then re-run the code. Did that resolve the issue?
Thanks for your reply.
No, it does not solve the issue.
I thought about generating a data step that rebuilds the data in question. I got it from you some years ago 🙂
But I think that it wouldn't help because the table once in SAS has already the issues I'm dealing with...
I paste the txt file with some of the troublemaker lines...
FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202303 2120 1329 356 10 11 0 0 0 0 9 3 0 0 0 #N/D
FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202304 2843 1755 511 13 44 0 0 0 0 10 0 0 0 0 #N/D
FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202305 4820 2976 831 27 72 0 0 0 0 13 5 0 0 0 #N/D
FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202306 4566 2689 736 29 138 0 0 0 0 12 9 0 0 0 #N/D
FINANCE_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202303 794 476 209 1 3 0 0 0 0 4 1 0 0 0 #N/D
FINANCE_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202405 4393 6385 1388 7 0 0 0 0 0 0 0 47 7 0 0 0 0 0 0 0 4 0 0 1381 0 3 #N/D
FINANCE_ES_RT_VWFS_ENCUESTA_CONTRATACION_ACUM Encuesta_Survey_Contratacion_Entrega 202402 973 1313 257 0 0 0 0 0 0 0 0 27 4 0 0 0 0 0 0 0 4 0 0 247 0 6 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202310 736 948 152 0 0 0 0 0 0 0 0 10 1 0 0 0 0 0 0 0 1 0 0 141 0 10 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202311 448 630 125 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 1 0 0 118 0 6 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202312 48 63 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202401 168 195 33 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 32 0 1 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202402 907 1171 214 0 0 0 0 0 0 0 0 6 3 0 0 0 0 0 0 0 0 0 0 206 0 8 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202403 1043 1346 243 0 0 0 0 0 0 0 0 9 2 0 0 0 0 0 0 0 4 0 0 226 0 13 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202404 808 1031 186 0 0 0 0 0 0 0 0 6 1 0 0 0 0 0 0 0 1 0 0 172 0 13 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO Comunicación 1 año 202405 725 921 139 0 0 0 0 0 0 0 0 10 2 0 0 0 0 0 0 0 1 0 0 135 0 3 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO_ACUM Comunicación 1 año 202402 141 277 61 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 0 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202310 493 672 127 0 0 0 0 0 0 0 0 9 2 0 0 0 0 0 0 0 2 0 0 122 0 3 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202311 211 312 67 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 65 0 2 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202401 126 160 19 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 18 0 1 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202402 406 555 99 0 0 0 0 0 0 0 0 4 2 0 0 0 0 0 0 0 3 0 0 93 0 3 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202403 835 927 151 1 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 3 0 0 143 0 5 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202404 705 756 154 3 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 11 0 0 126 0 17 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION Encuesta_Survey_Contratacion_Entrega 202405 733 797 175 1 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 4 0 0 160 0 9 #N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION_ACUM Encuesta_Survey_Contratacion_Entrega 202402 24 52 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 #N/D
Looks like that column is sometimes stored as UTF-8, and sometimes as WLATIN1. This may need some tricky coding: read the data as raw as possible, maybe with a $CHAR informat, and then check for hex codes to identify WLATIN1 characters for conversion to UTF-8.
So you have two different variations on spelling of that Communication word. One is using UTF-8 encoding and one is using some single byte encoding, such as WLATIN1. You can see that with this test program:
data test;
str1='ción';
str2='ción';
length str1_wlatin str2_utf8 $8;
str1_wlatin=kcvt(str1,'utf-8','wlatin1');
str2_utf8=kcvt(str2,'wlatin1','utf-8');
put (str:) (=/) (str:) (/$hex.);
run;
Results
str1=ción str2=ción str1_wlatin=ción str2_utf8=ción 6369C3B36E 6369F36E 6369F36E20202020 6369C3B36E202020
It will probably be best to read this file using a SAS session running with system option ENCODING set to UTF-8. (You can store any character from a single byte encoding into UTF-8 encoding, but you are not assured of being able to go the other way.)
So first read the file use ENCODING=ANY on the FILENAME statement so that SAS does not try to transcode the characters while reading them into the variables.
Then see if you can detect which encoding each observation is using. For example by checking for the 'C3'X byte that UTF-8 uses. Then use KCVT() function to convert the WLATIN1 strings into UTF8 strings.
If you are positive there a no characters in the UTF-8 strings that cannot be transcoded into WLATIN1 you could run with system ENCODING setting set to WLATIN1 and transcode the UTF-8 strings to WLATIN1.
@Tom , please explain the line
put (str:) (=/) (str:) (/$hex.);
(str:) ok
(=/) carriage return?
(str:) why again?
(/$hex.) apply hex format
But how Do I use the parenthesis () correctly?
Just read the documentation. That is a variable list followed by a format list.
If you are still confused go read how FORTRAN FORMAT statements worked 50 years ago.
@Tom wrote:
Just read the documentation. That is a variable list followed by a format list.
If you are still confused go read how FORTRAN FORMAT statements worked 50 years ago.
FORTRAN format was only 46 years ago for me and haven't written a line of FORTRAN since 1996...
@acordes wrote:
@Tom , please explain the line
put (str:) (=/) (str:) (/$hex.);
(str:) ok
(=/) carriage return?
(str:) why again?
(/$hex.) apply hex format
But how Do I use the parenthesis () correctly?
The syntax is (variable_list) (format_list). The first variable has the first format applied, the second uses the second. It there are more variables than entries in the format list the format list is scanned over and over.
(str:) -> Variable list. All variables whose name starts with STR.
(=/) -> Format list. In this case just the = and / modifiers. So the variable values are prefixed by their names and a line break is inserted between each one.
(str:) -> Another variable list. In this case it is the same list as used before.
(/$hex.) -> Another format list. In this case the $HEX format. Again line breaks are inserted.
Try this one:
put (3*str1) ($. $quote. $hex.);
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.