Solved: Re: I'm lost with import of tab delimited txt file as special characte...

acordes · Posted 06-11-2024 06:06 AM

I've never really made friends with the import of txt files. And luckily I can handle 99% of my data intake through a DB connector with Oracle, so normally I can forget about my shortcomings.

The problem of wrongly interpreted special characters is already present in the txt file.

But it gets worse as the lines with special characters do not get assigned the correct column. It gets displaced after encountering a special character like 'ó' and the text is broken at this position.

the string which cause such behavior is:

ComunicaciÃ³n 1 aÃ±o
Comunicación 1 año

FILENAME REFFILE FILESRVC FOLDERPATH='/Projects/crm reporting/Encuestas'  FILENAME='DV_Resultado_Campaña_ac.txt' encoding='wlatin1';
data as.DV_RES_CAMP;
    %let _EFIERR_ = 0;
    infile REFFILE delimiter='09'x dsd MISSOVER lrecl=32767 firstobs=2;
    informat JourneyName $80.;
    informat EmailName $100.;
    informat Mes_Envio best32.;
    informat num_envio best32.;
    informat num_open best32.;
    informat num_click best32.;
    informat num_Unsubscribe best32.;
    informat num_caso_renovar best32.;
    informat num_caso_refinanciar best32.;
    informat num_caso_pagaralvencimiento best32.;
    informat num_caso_devolver best32.;
    informat Num_Caso_masinfo best32.;
    informat Num_Caso_QuedarseloFF best32.;
    informat num_caso_Ampliar best32.;
    informat num_Harbounce best32.;
    informat num_Softbounce best32.;
    informat click_renovar best32.;
    informat click_devolver best32.;
    informat click_me_lo_quedo best32.;
    informat Click_Refinanciar best32.;
    informat Click_PagarAlVencimiento best32.;
    informat click_info best32.;
    informat click_portal best32.;
    informat click_baja best32.;
    informat click_ampliar best32.;
    informat Click_QuedarseloFF best32.;
    informat click_encuesta best32.;
    informat click_oferta best32.;
    informat Click_ViewM best32.;
    informat Marca $10.;
    format JourneyName $80.;
    format EmailName $100.;
    format Mes_Envio best12.;
    format num_envio best12.;
    format num_open best12.;
    format num_click best12.;
    format num_Unsubscribe best12.;
    format num_caso_renovar best12.;
    format num_caso_refinanciar best12.;
    format num_caso_pagaralvencimiento best12.;
    format num_caso_devolver best12.;
    format Num_Caso_masinfo best12.;
    format Num_Caso_QuedarseloFF best12.;
    format num_caso_Ampliar best12.;
    format num_Harbounce best12.;
    format num_Softbounce best12.;
    format click_renovar best12.;
    format click_devolver best12.;
    format click_me_lo_quedo best12.;
    format Click_Refinanciar best32.;
    format Click_PagarAlVencimiento best32.;
    format click_info best12.;
    format click_portal best12.;
    format click_baja best12.;
    format click_ampliar best12.;
    format Click_QuedarseloFF best12.;
    format click_encuesta best32.;
    format click_oferta best12.;
    format Click_ViewM best12.;
    format Marca $10.;
    input
        JourneyName $
        EmailName $
        Mes_Envio
        num_envio
        num_open
        num_click
        num_Unsubscribe 
        num_caso_renovar 
        num_caso_refinanciar
        num_caso_pagaralvencimiento 
		num_caso_devolver  
		Num_Caso_masinfo  
		Num_Caso_QuedarseloFF  
		num_caso_Ampliar  
		num_Harbounce
		num_Softbounce
		click_renovar
		click_devolver
		click_me_lo_quedo
		Click_Refinanciar
   		Click_PagarAlVencimiento 
		click_info
		click_portal
		click_baja
		click_ampliar
		Click_QuedarseloFF
		click_encuesta
		click_oferta
		Click_ViewM
		Marca  $
;
if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */
run;

Tom · Posted 06-11-2024 09:34 AM

So you have two different variations on spelling of that Communication word. One is using UTF-8 encoding and one is using some single byte encoding, such as WLATIN1. You can see that with this test program:

data test;
  str1='ciÃ³n';
  str2='ción';
  length str1_wlatin str2_utf8 $8;
  str1_wlatin=kcvt(str1,'utf-8','wlatin1');
  str2_utf8=kcvt(str2,'wlatin1','utf-8');
  put (str:) (=/) (str:) (/$hex.);
run;

Results

str1=ciÃ³n
str2=ción
str1_wlatin=ción
str2_utf8=ciÃ³n
6369C3B36E
6369F36E
6369F36E20202020
6369C3B36E202020

It will probably be best to read this file using a SAS session running with system option ENCODING set to UTF-8. (You can store any character from a single byte encoding into UTF-8 encoding, but you are not assured of being able to go the other way.)

So first read the file use ENCODING=ANY on the FILENAME statement so that SAS does not try to transcode the characters while reading them into the variables.

Then see if you can detect which encoding each observation is using. For example by checking for the 'C3'X byte that UTF-8 uses. Then use KCVT() function to convert the WLATIN1 strings into UTF8 strings.

If you are positive there a no characters in the UTF-8 strings that cannot be transcoded into WLATIN1 you could run with system ENCODING setting set to WLATIN1 and transcode the UTF-8 strings to WLATIN1.

View solution in original post

SASJedi · Posted 06-11-2024 07:51 AM

This code appears to have been generated by PROC IMPORT. I suspect that the input file includes some double-byte characters - in may be UTF-8 encoded instead of WLATIN-1. Try changing the FILENAME statement to specify UTF8:

FILENAME REFFILE FILESRVC FOLDERPATH='/Projects/crm reporting/Encuestas'  FILENAME='DV_Resultado_Campaña_ac.txt' encoding='UTF8';

Then re-run the code. Did that resolve the issue?

Check out my Jedi SAS Tricks for SAS Users

acordes · Posted 06-11-2024 07:59 AM

Thanks for your reply.

No, it does not solve the issue.

I thought about generating a data step that rebuilds the data in question. I got it from you some years ago 🙂

But I think that it wouldn't help because the table once in SAS has already the issues I'm dealing with...

I paste the txt file with some of the troublemaker lines...

FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	ComunicaciÃ³n 1 aÃ±o	202303	2120	1329	356									10	11	0	0	0			0	9	3	0		0	0		#N/D
FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	ComunicaciÃ³n 1 aÃ±o	202304	2843	1755	511									13	44	0	0	0			0	10	0	0		0	0		#N/D
FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	ComunicaciÃ³n 1 aÃ±o	202305	4820	2976	831									27	72	0	0	0			0	13	5	0		0	0		#N/D
FINANCE_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	ComunicaciÃ³n 1 aÃ±o	202306	4566	2689	736									29	138	0	0	0			0	12	9	0		0	0		#N/D
FINANCE_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202303	794	476	209									1	3	0	0	0			0	4	1	0		0	0		#N/D
FINANCE_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202405	4393	6385	1388	7	0	0	0	0	0	0	0	47	7	0	0	0	0	0	0	0	4	0	0	1381	0	3	#N/D
FINANCE_ES_RT_VWFS_ENCUESTA_CONTRATACION_ACUM	Encuesta_Survey_Contratacion_Entrega	202402	973	1313	257	0	0	0	0	0	0	0	0	27	4	0	0	0	0	0	0	0	4	0	0	247	0	6	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202310	736	948	152	0	0	0	0	0	0	0	0	10	1	0	0	0	0	0	0	0	1	0	0	141	0	10	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202311	448	630	125	0	0	0	0	0	0	0	0	4	1	0	0	0	0	0	0	0	1	0	0	118	0	6	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202312	48	63	16	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	16	0	0	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202401	168	195	33	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	32	0	1	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202402	907	1171	214	0	0	0	0	0	0	0	0	6	3	0	0	0	0	0	0	0	0	0	0	206	0	8	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202403	1043	1346	243	0	0	0	0	0	0	0	0	9	2	0	0	0	0	0	0	0	4	0	0	226	0	13	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202404	808	1031	186	0	0	0	0	0	0	0	0	6	1	0	0	0	0	0	0	0	1	0	0	172	0	13	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO	Comunicación 1 año	202405	725	921	139	0	0	0	0	0	0	0	0	10	2	0	0	0	0	0	0	0	1	0	0	135	0	3	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_ANIVERSARIO_ACUM	Comunicación 1 año	202402	141	277	61	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	61	0	0	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202310	493	672	127	0	0	0	0	0	0	0	0	9	2	0	0	0	0	0	0	0	2	0	0	122	0	3	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202311	211	312	67	0	0	0	0	0	0	0	0	3	0	0	0	0	0	0	0	0	0	0	0	65	0	2	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202401	126	160	19	0	0	0	0	0	0	0	0	2	0	0	0	0	0	0	0	0	0	0	0	18	0	1	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202402	406	555	99	0	0	0	0	0	0	0	0	4	2	0	0	0	0	0	0	0	3	0	0	93	0	3	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202403	835	927	151	1	0	0	0	0	0	0	0	15	1	0	0	0	0	0	0	0	3	0	0	143	0	5	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202404	705	756	154	3	0	0	0	0	0	0	0	9	0	0	0	0	0	0	0	0	11	0	0	126	0	17	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION	Encuesta_Survey_Contratacion_Entrega	202405	733	797	175	1	0	0	0	0	0	0	0	9	0	0	0	0	0	0	0	0	4	0	0	160	0	9	#N/D
RENTING_ES_RT_VWFS_ENCUESTA_CONTRATACION_ACUM	Encuesta_Survey_Contratacion_Entrega	202402	24	52	7	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	7	0	0	#N/D

Kurt_Bremser · Posted 06-11-2024 08:57 AM

Looks like that column is sometimes stored as UTF-8, and sometimes as WLATIN1. This may need some tricky coding: read the data as raw as possible, maybe with a $CHAR informat, and then check for hex codes to identify WLATIN1 characters for conversion to UTF-8.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 06-11-2024 09:34 AM

So you have two different variations on spelling of that Communication word. One is using UTF-8 encoding and one is using some single byte encoding, such as WLATIN1. You can see that with this test program:

data test;
  str1='ciÃ³n';
  str2='ción';
  length str1_wlatin str2_utf8 $8;
  str1_wlatin=kcvt(str1,'utf-8','wlatin1');
  str2_utf8=kcvt(str2,'wlatin1','utf-8');
  put (str:) (=/) (str:) (/$hex.);
run;

Results

str1=ciÃ³n
str2=ción
str1_wlatin=ción
str2_utf8=ciÃ³n
6369C3B36E
6369F36E
6369F36E20202020
6369C3B36E202020

It will probably be best to read this file using a SAS session running with system option ENCODING set to UTF-8. (You can store any character from a single byte encoding into UTF-8 encoding, but you are not assured of being able to go the other way.)

So first read the file use ENCODING=ANY on the FILENAME statement so that SAS does not try to transcode the characters while reading them into the variables.

Then see if you can detect which encoding each observation is using. For example by checking for the 'C3'X byte that UTF-8 uses. Then use KCVT() function to convert the WLATIN1 strings into UTF8 strings.

If you are positive there a no characters in the UTF-8 strings that cannot be transcoded into WLATIN1 you could run with system ENCODING setting set to WLATIN1 and transcode the UTF-8 strings to WLATIN1.

acordes · Posted 06-11-2024 09:58 AM

@Tom , please explain the line

  put (str:) (=/) (str:) (/$hex.);

(str:) ok

(=/) carriage return?

(str:) why again?

(/$hex.) apply hex format

But how Do I use the parenthesis () correctly?

Tom · Posted 06-11-2024 11:33 AM

Just read the documentation. That is a variable list followed by a format list.

If you are still confused go read how FORTRAN FORMAT statements worked 50 years ago.

ballardw · Posted 06-11-2024 11:58 AM

@Tom wrote:

Just read the documentation. That is a variable list followed by a format list.

If you are still confused go read how FORTRAN FORMAT statements worked 50 years ago.

FORTRAN format was only 46 years ago for me and haven't written a line of FORTRAN since 1996...

Tom · Posted 06-11-2024 12:11 PM

@acordes wrote:

@Tom , please explain the line
  put (str:) (=/) (str:) (/$hex.);
(str:) ok

(=/) carriage return?

(str:) why again?

(/$hex.) apply hex format

But how Do I use the parenthesis () correctly?

The syntax is (variable_list) (format_list). The first variable has the first format applied, the second uses the second. It there are more variables than entries in the format list the format list is scanned over and over.

(str:) -> Variable list. All variables whose name starts with STR.

(=/) -> Format list. In this case just the = and / modifiers. So the variable values are prefixed by their names and a line break is inserted between each one.

(str:) -> Another variable list. In this case it is the same list as used before.

(/$hex.) -> Another format list. In this case the $HEX format. Again line breaks are inserted.

Try this one:

put (3*str1) ($. $quote. $hex.);

I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Re: I'm lost with import of tab delimited txt file as special characters displace columns

Register Today!

SAS Training: Just a Click Away