BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SASdevAnneMarie
Barite | Level 11

Hello Experts,

I can't import the attached file.

I don't know what is wrong with my code, before it worked well.

Now the data is completely wrong from the line 756 (if don't edit the file).

 

If I edit the file (even just delete the lines to make the file more light) the data is COMPLITELY wrong.

The original file is very huge, I can't download it, even with compress, I attached the edited file.

 


data EUROPERF_IMP1;
	dlm=';'||'0D0A'x;
	infile "XXXXXX\Eligibles_c.csv_20200810.csv" dlm=dlm  DSD lrecl=32867 /*termstr=CRLF*/
	RECFM=n;

	informat	CodeIsin	$12.;
	informat	Libelle	$50.;
	informat	nature	$12.;
	informat	categAMF	$32.;
	informat	categEPF	$50.;
	informat	zoneGeo	$40.;
	informat	dureeMinReco $8.;
	informat	affectationResultat	$40.;
	informat	fiscaliteText	$50.;
	informat	deviseComptable	$3.;
	informat	dateCloture	$32.;
	informat	dateOuverture	$32.;
	informat	promoteur	$50.;
	informat	adressePromoteur	$80.;
	informat	steGestion	$70.;
	informat	adresseSteGestion	$80.;
	informat	steAdmDelegue	$80.;
	informat	steFinAdmDelegue	$80.;
	informat	depositaire	$80.;
	informat	adresseDepositaire	$80.;
	informat	commissaireCpte	$50.;
	informat	souscription	$100.;
	informat	rachat	$50.;
	informat	fraisGestion	$70.;
	informat	datePerformance	$12.;
	informat	perf12mois	best32.;
	informat	perf36mois	$5.;
	informat	perf60mois	$5.;
	informat	fraisGestionCommentaire	$256.;
	informat	objectifGestion	$256.;
	informat	souscripteurs	$256.;
	informat	orientationPlacement	$256.;
	informat	dateClotureExercice	$256.;
	informat	periodeVL	$256.;
	informat	sitePublication	$256.;
	informat	conditionSouscripRachat	$256.;
	informat	dateValidation		$12.;
	informat	minimumRisqueAction	$5.;
	informat	dateAgrementAMF		$12.;
	informat	Tfe	$20.;
	informat	perfAm1	$8.;
	informat	perfAm2	$8.;
	informat	perfAm3	$8.;
	informat	perfAm4	$8.;
	informat	perfAm5	$8.;
	informat	perfAm6	$8.;
	informat	perfAm7	$8.;
	informat	perfAm8	$8.;
	informat	UnitedureeMinReco	$7.;
	informat	perfYTD	$5.;
	informat	volat_12mois	$5.;
	informat	frequence_text	$30.;
	informat	prospectus_url	$50.;
	informat	jour_valorisation $8.;
	informat	risque	$6.;
	informat	VIDE	$4.;
	format 	CodeIsin	$12.;
	format 	Libelle	$50.;
	format 	nature	$12.;
	format 	categAMF	$32.;
	format 	categEPF	$50.;
	format 	zoneGeo	$40.;
	format 	dureeMinReco $8.;
	format 	affectationResultat	$40.;
	format 	fiscaliteText	$50.;
	format 	deviseComptable	$3.;
	format 	dateCloture	$32.;
	format 	dateOuverture	$32.;
	format 	promoteur	$50.;
	format 	adressePromoteur	$80.;
	format 	steGestion	$70.;
	format 	adresseSteGestion	$80.;
	format 	steAdmDelegue	$80.;
	format 	steFinAdmDelegue	$80.;
	format 	depositaire	$80.;
	format 	adresseDepositaire	$80.;
	format 	commissaireCpte	$50.;
	format 	souscription	$100.;
	format 	rachat	$5.;
	format 	fraisGestion	$70.;
	format 	datePerformance $12.;
	format 	perf12mois	best32.;
	format 	perf36mois	$5.;
	format 	perf60mois	$5.;
	format 	fraisGestionCommentaire	$256.;
	format 	objectifGestion	$256.;
	format 	souscripteurs	$256.;
	format 	orientationPlacement	$256.;
	format 	dateClotureExercice	$256.;
	format 	periodeVL	$256.;
	format 	sitePublication	$256.;
	format  conditionSouscripRachat	$256.;
	format 	dateValidation		$12.;
	format 	minimumRisqueAction	$5.;
	format 	dateAgrementAMF		$12.;
	format 	Tfe	$20.;
	format 	perfAm1	$8.;
	format 	perfAm2	$8.;
	format 	perfAm3	$8.;
	format 	perfAm4	$8.;
	format 	perfAm5	$8.;
	format 	perfAm6	$8.;
	format 	perfAm7	$8.;
	format 	perfAm8	$8.;
	format 	UnitedureeMinReco	$7.;
	format 	perfYTD	$5.;
	format 	volat_12mois	$5.;
	format 	frequence_text	$30.;
	format 	prospectus_url	$50.;
	format 	jour_valorisation	$8.;
	format 	risque	$6.;
	format	VIDE	$4.;
	input			
		CodeIsin	$ 	
		Libelle	$	
		nature	$	
		categAMF	$	
		categEPF	$	
		zoneGeo	$	
		dureeMinReco $ 		
		affectationResultat	$	
		fiscaliteText	$	
		deviseComptable	$	
		dateCloture	$	
		dateOuverture	$	
		promoteur	$	
		adressePromoteur	$	
		steGestion	$	
		adresseSteGestion	$	
		steAdmDelegue	$	
		steFinAdmDelegue	$	
		depositaire	$	
		adresseDepositaire	$	
		commissaireCpte	$	
		souscription	$	
		rachat	$	
		fraisGestion	$	
		datePerformance	$
		perf12mois		
		perf36mois	$	
		perf60mois $		
		fraisGestionCommentaire	$	
		objectifGestion	$	
		souscripteurs	$
		orientationPlacement	$	
		dateClotureExercice	$	
		periodeVL	$	
		sitePublication	$	
		conditionSouscripRachat	$	
		dateValidation	$	
		minimumRisqueAction	$	
		dateAgrementAMF	$	
		Tfe	$	
		perfAm1	$	
		perfAm2	$	
		perfAm3	$	
		perfAm4	$	
		perfAm5	$	
		perfAm6	$	
		perfAm7	$	
		perfAm8	$	
		UnitedureeMinReco $		
		perfYTD	$	
		volat_12mois $		
		frequence_text	$	
		prospectus_url	$	
		jour_valorisation	$	
		risque 	$	
		VIDE $;
run;

 

Thank you for your help!

Best regards,

Marie

 

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

@SASdevAnneMarie,

 

I have written a new SAS program.  This new SAS program must be run before the main import program.  This new SAS program requires that each record end with "VIDELF" or the program will not run correctly.

 

The new SAS program does several things:

  1. It replaces VIDELF (incorrect) with VIDECRLF (correct).
  2. It removes LF's that are in error and replaces them with a semi colon.
  3. It removes CR's that are in error and replaces them with a semi colon.

Here is the code:

 

%LET	Cmnt					=	*;
%LET	LRECL					=	32767;
%LET	Limit					=	50000;

FILENAME	Elig_In		'XXX\Marie_Eligibles_c.csv_20200910 (2).csv';

FILENAME	Elig_Out	'XXX\Marie_Eligibles_c.csv_20200910 (2)_SAS_Pre_Edit.csv';

DATA	_NULL_;
	RETAIN	Carriage_Return		'0D'x;
	RETAIN	Line_Feed			'0A'x;
	RETAIN	Semi_Colon			';';

	%IF	%BQUOTE(&Cmnt)			=	%BQUOTE()			%THEN
		%DO;
			IF	_N_				>	&Limit				THEN
				DO;
					PUTLOG	"NOTE-  ";
					PUTLOG	"NOTE:  Stopping "	_N_=;
					STOP;
				END;
		%END;

	INFILE	Elig_In
		RECFM					=	N
		;

	FILE	Elig_Out
		LRECL					=	&LRECL
		RECFM					=	N
		;

	ARRAY	Chars	[5]	$1	_TEMPORARY_;

	INPUT	Char	$1.;

	LINK	Load_Array;
	LINK	Check_Chars;

	******;
	RETURN;
	******;

	**********;
	Load_Array:
	**********;
		Chars[5]				=	Chars[4];
		Chars[4]				=	Chars[3];
		Chars[3]				=	Chars[2];
		Chars[2]				=	Chars[1];
		Chars[1]				=	UPCASE(Char);
	******;
	RETURN;
	******;

	***********;
	Check_Chars:
	***********;
		IF	Chars[5]			=	'V'					AND
			Chars[4]			=	'I'					AND
			Chars[3]			=	'D'					AND
			Chars[2]			=	'E'					AND
			Chars[1]			=	Line_Feed			THEN
			DO;
				&Cmnt	PUTLOG	"NOTE-  *** VIDE and Line feed detected *** "					_N_= Chars[5]= Chars[4]= Chars[3]= Chars[2]= Chars[1]= Char=;
				PUT	Carriage_Return		$1.;
				PUT	Line_Feed			$1.;
			END;
		ELSE
			DO;
				IF	Char		=	Line_Feed			THEN
					DO;
						&Cmnt	PUTLOG	"NOTE-  *** Stand alone line feed detected *** "		_N_= Char=;
						PUT	Semi_Colon	$1.;
					END;
				ELSE
				IF	Char		=	Carriage_Return		THEN
					DO;
						&Cmnt	PUTLOG	"NOTE-  *** Stand alone carriage return detected *** "	_N_= Char=;
						PUT	Semi_Colon	$1.;
					END;
				ELSE
					DO;
						&Cmnt	PUTLOG	"NOTE-  Other character detected "						_N_= Char=;
						PUT	Char		$1.;
					END;
			END;
	******;
	RETURN;
	******;
RUN;

After you run the above new SAS program, you can run the import program with no other steps in between.

 

This new SAS program will not take care of record 9 which has LF's that are not separators and are not terminators.  It may be possible to remove these, but that will take extra programming, and I need to get back to my regular job.  Really, if the data is this bad, the person who creates the data should fix the data.  If the data cannot be fixed, then the company in charge or reading the data needs to hire a consultant or contractor who has the specialized programming skill needed to work with this kind of data.  I would be happy to do it, but I cannot work for free always.  Usually, I try to give advice or tips so that the person posting to the SAS Community can then continue on their own, but here I am 100% writing the entire program which is not exactly the spirit of the SAS Community as I understand it.  The SAS Community is here to assist but not to be free labor.  I hope you understand.

 

If you have any problems with the program, please let me know, and I will he happy to assist.  If the programs serve you, please mark one of my posts as the solution so that I may at least have a little recognition for my work.  I would appreciate that.

 

Jim

 

View solution in original post

52 REPLIES 52
jimbarbour
Meteorite | Level 14

Marie,

 

My first reaction is that this code doesn't look right to me:

 

dlm=';'||'0D0A'x;

This code is saying that each column should be separated by three characters, taken together:

 

  1. A semi-colon -- ';'
  2. A carriage return -- '0D'x
  3. A line feed -- '0A'x

When I look at the data, I only see the semi-colon between each field.  I would change the delimiter setting to only the below and try re-running.

dlm=';'

Jim

 

SASdevAnneMarie
Barite | Level 11
Thank you,
I tried this, but it doesn’t work:(
jimbarbour
Meteorite | Level 14

Yes, I now understand more, and you are correct. Please see my latest response, the very last one, below.

Jim

SASdevAnneMarie
Barite | Level 11
Thank you!
jimbarbour
Meteorite | Level 14

I changed the program slightly as follows:

data EUROPERF_IMP1;
	infile "R:\users\jbarbou3\Eligibles_c.csv_20200810.csv" 
	MISSOVER
/*	termstr=CRLF*/
	lrecl=32867
	DSD 
	dlm=';'
	RECFM=n
	;

I ran the program, and the data looks like this:

Diagonal_Pattern_2020-09-10_12-53-44.jpg

Do you see the diagonal pattern in the data?  Whenever I have seen a diagonal pattern like that in the past, that typically means that there is a mis-match between the number of columns defined in the INFILE vs. the number of delimiters (in this case a semi-colon) physically present in the data.  I tried adding a MISSOVER parameter, but that did not help.

 

Has the data changed or been redefined in any way recently?  

 

Jim

Tom
Super User Tom
Super User

Your file has multiple end of line characters inserted into the middle of the lines of data so not all lines have the same number of values.

data check;
 infile "&path/&fname" termstr=crlf;
 row+1;
 input;
 words=countw(_infile_,';','mq');
run;

proc freq ; tables words; run;
The FREQ Procedure

                                  Cumulative    Cumulative
words    Frequency     Percent     Frequency      Percent
----------------------------------------------------------
   16           1        0.82             1         0.82
   20           1        0.82             2         1.64
   29           1        0.82             3         2.46
   30           1        0.82             4         3.28
   31           5        4.10             9         7.38
   32           1        0.82            10         8.20
   33          22       18.03            32        26.23
   34          25       20.49            57        46.72
   35          11        9.02            68        55.74
   36           3        2.46            71        58.20
   43           1        0.82            72        59.02
   50           2        1.64            74        60.66
   52           1        0.82            75        61.48
   53           2        1.64            77        63.11
   54           4        3.28            81        66.39
   55          11        9.02            92        75.41
   56          30       24.59           122       100.00

There do not appear to be quotes around them or any other way to safely distinguish the spurios linebreaks from the real end of lines.

Can you have the file recreated without those characters?

jimbarbour
Meteorite | Level 14

@SASdevAnneMarie,

 

I would agree with @Tom that the best thing would be to have the data re-created with the following in mind: 

  1. The data should not include Line Feed ('0A'x) characters within the data. 
  2. The delimiters must consistently be a semi-colon.  
  3. The same number of delimiters must be on each line.
  4. The line terminators should all be a combination of a Carriage Return ('0D'x) and a Line Feed ('0A'x) character (CRLF).  The data is already correct as it is, but this must not be changed.

However, if it would be difficult to re-create the data, the data can be made to work.  I made a few changes to the program and the data, and I got the following:

Fr_Data_Corrected_2020-09-10_12-53-44.jpg

Notice that the data is now aligned properly.

 

Here is what I did:

First, the data sometimes contained Line Feed characters ('0A'x) as delimiters and sometimes the data contained semi-colons as delimiters.  I changed all the solo Line Feed characters ('0A'x) to a semi-colon.  Note that however each line is terminated by a combination Carriage Return ('0D'x) Line Feed ('0A'x).  The combination must be preserved.  In order to preserve the combinations, I first changed all the combinations ('0D0A'x) to '^^^'.  Then I changed all the solo Line Feeds to a semi-colon.  Then I changed all the '^^^' back to a combination Carriage Return Line Feed ('0D0A'x).  It's actually easier that my description may make it sound.  I did this in Notepad++ but any good editor should suffice.

 

Second, I changed the INFILE statement as follows:

 

	infile "XXXXXXXXXX\Eligibles_c.csv_20200810.csv" 
	termstr=CRLF
	lrecl=32767
	FIRSTOBS=2
	MISSOVER
	RECFM=V
	dlm=';'
	DSD 
	;

The important changes are:

  1. The FIRSTOBS must be set to "2" so that the column headers are not treated as data.
  2. The RECFM must be "V" (variable) not "N" (None, which implies streamed data).
  3. The DLM must be ";" and only a semi-colon.  CRLF should not be part of the the delimiter.
  4. The TERMSTR must be CRLF.

I did change the LRECL slightly, but this is not too important as long as the LRECL is longer than the longest record.  If you start getting truncation, you should increase this parameter.

 

So, in conclusion, I fully agree with @Tom that the best thing would be to go back to whoever created the data in the first place and have them re-create the data correctly.  However, if you cannot reasonably have the data re-created, then you can change the solo Line Feed characters to semi-colons, make the program changes I described above, and the program will work correctly.

 

I hope this is helpful,

 

Jim

 

SASdevAnneMarie
Barite | Level 11
Thank you very much!
You changed the line feeds on Notepad please?
jimbarbour
Meteorite | Level 14

I used Notepad++ not regular Notepad.  There are probably other editors that would work as well, but I believe that Notepad++ is free.

 

Jim

SASdevAnneMarie
Barite | Level 11
Thank you!
Is it possible to make this changes automatically? I’ll receive this type of file for Input for my program.:(
Tom
Super User Tom
Super User

@SASdevAnneMarie wrote:
Thank you!
Is it possible to make this changes automatically? I’ll receive this type of file for Input for my program.:(

You need to talk to the source for this file and see if they can adjust there process to produce a file that can be parsed.  Find out what software they are using to create the file and perhaps there are simple changes in how they create the file that could eliminate this issue.

 

Ask them to remove any carriage return ('0D'x) or linefeed ('0A'x) values from the data fields.  They can either just strip them out or replace them with some other character(s).  Make sure that fields are separated by the same delimiter. And that any value that contains the delimiter is enclosed in quotes.  Any values that contain quotes are also enclosed in quotes and the existing quotes are doubled up.

 

You could try to recreate the manual step of replacing linefeeds with semi-colons.

filename copy temp;
data _null_;
  infile "&path/&fname" termstr=crlf ;
  input;
  _infile_=translate(_infile_,';','0A'x);
  file copy;
  put _infile_;
run;

But when I do that I notice that some of the lines are close to the limit of 32K bytes that can be handled by the _INFILE_ automatic variable. 

     The minimum record length was 717.
     The maximum record length was 32489.

So you might need a more complicated process instead that reads the file character by character instead.

jimbarbour
Meteorite | Level 14

@Tom,

 

I noticed that as well, that the some of the records in the data are fairly close to the 32,767 limit.  I wrote a quick program that can replace the Line Feed characters with semi-colons and handle records longer than 32,767.

%LET	LRECL	=	64000;

DATA	_NULL_;
	INFILE	"XXX\Eligibles_c.csv_20200810_Original.csv"
		LENGTH			=	Rec_Leng
		LRECL			=	&LRECL
		TERMSTR			=	CRLF
		RECFM			=	V
		;

	INPUT	@;

	IF	Rec_Leng		>=	&LRECL	THEN
		DO;
			CALL	SYMPUTX('SYSCC', 8, 'G');
			PUTLOG	"ERROR-  ";
			PUTLOG	"ERROR:  Record length >= &LRECL..";
			PUTLOG	"ERROR-  Truncation probable.";
			PUTLOG	"ERROR-  This run is bad.";
			PUTLOG	"ERROR-  Increase LRECL and re-run.";
			ABORT	CANCEL;
		END;

	IF	Rec_Leng		<=	32767	THEN
		DO;
			INPUT	@1		Work1	$VARYING32767.	Rec_Leng;
			Work1		=	TRANWRD(Work1, '0D0A'x, '^^^^');
			Work1		=	TRANWRD(Work1, '0A'x, ';');
			Work1		=	TRANWRD(Work1, '^^^^', '0D0A'x);
			FILE			"XXX\Eligibles_c.csv_20200810_SAS_Edit.csv"
				LRECL	=	&LRECL
				RECFM	=	V;
			PUT	@1			Work1;
		END;
	ELSE
		DO;
			Work_Leng	=	Rec_Leng - 32767;
			INPUT	@1		Work1	$CHAR32767.
					@32768	Work2	$VARYING32767.	Work_Leng;
			Work1		=	TRANWRD(Work1, '0D0A'x, '^^^^');
			Work1		=	TRANWRD(Work1, '0A'x, ';');
			Work1		=	TRANWRD(Work1, '^^^^', '0D0A'x);
			Work2		=	TRANWRD(Work2, '0D0A'x, '^^^^');
			Work2		=	TRANWRD(Work2, '0A'x, ';');
			Work2		=	TRANWRD(Work2, '^^^^', '0D0A'x);
			FILE			"R:\users\jbarbou3\Eligibles_c.csv_20200810_SAS_Edit.csv"
				LRECL	=	&LRECL
				RECFM	=	V;
			PUT	@1			Work1
				@32768		Work2
				;
		END;
RUN;

I've run some tests, and it does the job.  It's there if @SASdevAnneMarie encounters records that are longer than 32,767.

 

Jim

SASdevAnneMarie
Barite | Level 11
Hello,
Thank you!
I can't undersatand the "LENGTH= Rec_Leng", I understood the other lines.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 52 replies
  • 3582 views
  • 19 likes
  • 5 in conversation