I thought about it a bit more during my lunch time, and I am now able to get the data for the 9th row. I think it would be important for you to use this latest program because I changed my streaming definition from $1 to $CHAR1. A definition of $1 can result in some data oddities. I should have used $CHAR1 all along, but I didn't think about it until later.
Here is the latest pre-edit program. You will want to run this program before you run the import program:
%LET Cmnt = *;
%LET LRECL = 65535;
%LET Limit = 999999999;
FILENAME Elig_In 'R:\users\jbarbou3\Marie_Eligibles_c.csv_20200910 (2).csv';
FILENAME Elig_Out 'R:\users\jbarbou3\Marie_Eligibles_c.csv_20200910 (2)_SAS_Pre_Edit.csv';
DATA _NULL_;
RETAIN Carriage_Return '0D'x;
RETAIN Line_Feed '0A'x;
RETAIN Double_Quote '"';
RETAIN Semi_Colon ';';
RETAIN Space ' ';
RETAIN In_Quotes 0;
RETAIN Record_Cnt 1;
%IF %BQUOTE(&Cmnt) = %BQUOTE() %THEN
%DO;
IF _N_ > &Limit THEN
DO;
PUTLOG "NOTE- ";
PUTLOG "NOTE: Stopping " _N_=;
STOP;
END;
%END;
INFILE Elig_In
RECFM = N
;
FILE Elig_Out
LRECL = &LRECL
RECFM = N
;
ARRAY Chars [5] $1 _TEMPORARY_;
INPUT Char $CHAR1.;
&Cmnt IF _N_ = 1 THEN
PUTLOG "NOTE: Starting Run";
LINK Load_Array;
LINK Check_Chars;
******;
RETURN;
******;
**********;
Load_Array:
**********;
Chars[5] = Chars[4];
Chars[4] = Chars[3];
Chars[3] = Chars[2];
Chars[2] = Chars[1];
Chars[1] = UPCASE(Char);
******;
RETURN;
******;
***********;
Check_Chars:
***********;
IF Chars[5] = 'V' AND
Chars[4] = 'I' AND
Chars[3] = 'D' AND
Chars[2] = 'E' AND
Chars[1] = Line_Feed THEN
DO;
&Cmnt PUTLOG "NOTE- *** VIDE and Line feed detected *** " _N_= Record_Cnt= Chars[5]= Chars[4]= Chars[3]= Chars[2]= Chars[1]= Char=;
In_Quotes = 0;
PUT Carriage_Return $1.;
PUT Line_Feed $1.;
Record_Cnt + 1;
END;
ELSE
IF Char = Double_Quote THEN
DO;
&Cmnt PUTLOG "NOTE- *** Double quote detected *** " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- *** In_Quotes set to false *** " _N_= Record_Cnt= Char=;
In_Quotes = 0;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- *** In_Quotes set to true *** " _N_= Record_Cnt= Char=;
In_Quotes = 1;
END;
PUT Double_Quote $1.;
END;
ELSE
IF Char = Line_Feed THEN
DO;
&Cmnt PUTLOG "NOTE- *** Stand alone line feed detected *** " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- ### Chg embedded line feed to space ### " _N_= Record_Cnt= Char=;
PUT Space $1.;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- %%% Change line feed to semicolon %%% " _N_= Record_Cnt= Char=;
PUT Semi_Colon $1.;
END;
END;
ELSE
IF Char = Carriage_Return THEN
DO;
&Cmnt PUTLOG "NOTE- *** Stand alone carriage return detected *** " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- ### Change embedded CR to space ### " _N_= Record_Cnt= Char=;
PUT Space $1.;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- @@@ Chg carriage return to semicolon @@@ " _N_= Record_Cnt= Char=;
PUT Semi_Colon $1.;
END;
END;
ELSE
IF Char = Semi_Colon THEN
DO;
&Cmnt PUTLOG "NOTE- Stand alone semi colon detected " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- ^^^ Chg embedded semi colon to space^^^ " _N_= Record_Cnt= Char=;
PUT Space $1.;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- Retaining semi colon " _N_= Record_Cnt= Char=;
PUT Semi_Colon $1.;
END;
END;
ELSE
DO;
*&Cmnt PUTLOG "NOTE- Other character detected " _N_= Record_Cnt= Char=;
PUT Char $1.;
END;
******;
RETURN;
******;
RUN;
After you run the above pre-edit program, then run the below import program. I changed a couple of things from your original import program, so you will want to use this program. Among other things, I increased the LRECL to prevent trunction. I also increased the size of some of the variables like fraisGestionCommentaire to $1024. These variables were $256 and were truncating a lot of data. Even now, some of the data is getting truncated, so you may want to increase them to an even larger size.
%LET LRECL = 65535;
FILENAME Elig 'R:\users\jbarbou3\Marie_Eligibles_c.csv_20200910 (2)_SAS_Pre_Edit.csv';
data EUROPERF_IMP1_SAS;
INFILE Elig
LRECL = &LRECL
TERMSTR = CRLF
DLM = ';'
FIRSTOBS = 2
RECFM = V
MISSOVER
DSD
;
/* 56 Columns */
informat CodeIsin $12.;
informat Libelle $50.;
informat nature $12.;
informat categAMF $32.;
informat categEPF $50.;
informat zoneGeo $40.;
informat dureeMinReco $8.;
informat affectationResultat $40.;
informat fiscaliteText $50.;
informat deviseComptable $3.;
informat dateCloture $32.;
informat dateOuverture $32.;
informat promoteur $50.;
informat adressePromoteur $80.;
informat steGestion $70.;
informat adresseSteGestion $80.;
informat steAdmDelegue $80.;
informat steFinAdmDelegue $80.;
informat depositaire $80.;
informat adresseDepositaire $80.;
informat commissaireCpte $50.;
informat souscription $100.;
informat rachat $50.;
informat fraisGestion $70.;
informat datePerformance $12.;
informat perf12mois BEST32.;
informat perf36mois $5.;
informat perf60mois $5.;
informat fraisGestionCommentaire $1024.;
informat objectifGestion $1024.;
informat souscripteurs $1024.;
informat orientationPlacement $1024.;
informat dateClotureExercice $1024.;
informat periodeVL $1024.;
informat sitePublication $1024.;
informat conditionSouscripRachat $1024.;
informat dateValidation $12.;
informat minimumRisqueAction $5.;
informat dateAgrementAMF $12.;
informat Tfe $20.;
informat perfAm1 $8.;
informat perfAm2 $8.;
informat perfAm3 $8.;
informat perfAm4 $8.;
informat perfAm5 $8.;
informat perfAm6 $8.;
informat perfAm7 $8.;
informat perfAm8 $8.;
informat UnitedureeMinReco $7.;
informat perfYTD $5.;
informat volat_12mois $5.;
informat frequence_text $30.;
informat prospectus_url $50.;
informat jour_valorisation $8.;
informat risque $6.;
informat VIDE $4.;
/* 56 Columns */
format CodeIsin $12.;
format Libelle $50.;
format nature $12.;
format categAMF $32.;
format categEPF $50.;
format zoneGeo $40.;
format dureeMinReco $8.;
format affectationResultat $40.;
format fiscaliteText $50.;
format deviseComptable $3.;
format dateCloture $32.;
format dateOuverture $32.;
format promoteur $50.;
format adressePromoteur $80.;
format steGestion $70.;
format adresseSteGestion $80.;
format steAdmDelegue $80.;
format steFinAdmDelegue $80.;
format depositaire $80.;
format adresseDepositaire $80.;
format commissaireCpte $50.;
format souscription $100.;
format rachat $5.;
format fraisGestion $70.;
format datePerformance $12.;
format perf12mois BEST32.;
format perf36mois $5.;
format perf60mois $5.;
format fraisGestionCommentaire $1024.;
format objectifGestion $1024.;
format souscripteurs $1024.;
format orientationPlacement $1024.;
format dateClotureExercice $1024.;
format periodeVL $1024.;
format sitePublication $1024.;
format conditionSouscripRachat $1024.;
format dateValidation $12.;
format minimumRisqueAction $5.;
format dateAgrementAMF $12.;
format Tfe $20.;
format perfAm1 $8.;
format perfAm2 $8.;
format perfAm3 $8.;
format perfAm4 $8.;
format perfAm5 $8.;
format perfAm6 $8.;
format perfAm7 $8.;
format perfAm8 $8.;
format UnitedureeMinReco $7.;
format perfYTD $5.;
format volat_12mois $5.;
format frequence_text $30.;
format prospectus_url $50.;
format jour_valorisation $8.;
format risque $6.;
format VIDE $4.;
/* 56 Columns */
input
CodeIsin $
Libelle $
nature $
categAMF $
categEPF $
zoneGeo $
dureeMinReco $
affectationResultat $
fiscaliteText $
deviseComptable $
dateCloture $
dateOuverture $
promoteur $
adressePromoteur $
steGestion $
adresseSteGestion $
steAdmDelegue $
steFinAdmDelegue $
depositaire $
adresseDepositaire $
commissaireCpte $
souscription $
rachat $
fraisGestion $
datePerformance $
perf12mois
perf36mois $
perf60mois $
fraisGestionCommentaire $
objectifGestion $
souscripteurs $
orientationPlacement $
dateClotureExercice $
periodeVL $
sitePublication $
conditionSouscripRachat $
dateValidation $
minimumRisqueAction $
dateAgrementAMF $
Tfe $
perfAm1 $
perfAm2 $
perfAm3 $
perfAm4 $
perfAm5 $
perfAm6 $
perfAm7 $
perfAm8 $
UnitedureeMinReco $
perfYTD $
volat_12mois $
frequence_text $
prospectus_url $
jour_valorisation $
risque $
VIDE $;
run;
Here are the results; I think they look good.
There is still one problem in the data in record number 513. There is an extra delimiter before "risque". The extra delimiter is causing a mis-alignment in the data. You can just delete this one by hand.
Even though the topic is already marked as "solved," I hope you will see this latest program and that it will be useful to you.
Regards,
Jim
You are welcome, and I am glad I was able to solve it. No need to apologize for the bad data. The person who created such a mess should apologize! I hope my latest program will be useful to you for some time to come.
Jim
Hello Jim,
I tried to replace dlm=';'||'0D0A'x by dlm=';'||'0A'x in my code but it doesn't work. 😞
I made a request for data correction.
Thank you very much for your help!
Marie
Sorry to take so long to reply. I don't know how, but I missed your question.
The delimiter should be a semi-colon only. Do not add a '0A'x to the delimiter. The code should be as shown below. I have included both the pre-edit and he main import programs in one set of SAS code. This is how they should be run.
Jim
%LET Cmnt = *;
%LET LRECL = 65535;
%LET Limit = 999999999;
FILENAME Elig_In 'R:\users\jbarbou3\Marie_Eligibles_c.csv_20200910 (2).csv';
FILENAME Elig_Out 'R:\users\jbarbou3\Marie_Eligibles_c.csv_20200910 (2)_SAS_Pre_Edit.csv';
DATA _NULL_;
RETAIN Carriage_Return '0D'x;
RETAIN Line_Feed '0A'x;
RETAIN Double_Quote '"';
RETAIN Semi_Colon ';';
RETAIN Space ' ';
RETAIN In_Quotes 0;
RETAIN Record_Cnt 1;
%IF %BQUOTE(&Cmnt) = %BQUOTE() %THEN
%DO;
IF _N_ > &Limit THEN
DO;
PUTLOG "NOTE- ";
PUTLOG "NOTE: Stopping " _N_=;
STOP;
END;
%END;
INFILE Elig_In
RECFM = N
;
FILE Elig_Out
LRECL = &LRECL
RECFM = N
;
ARRAY Chars [5] $1 _TEMPORARY_;
INPUT Char $CHAR1.;
&Cmnt IF _N_ = 1 THEN
PUTLOG "NOTE: Starting Run";
LINK Load_Array;
LINK Check_Chars;
******;
RETURN;
******;
**********;
Load_Array:
**********;
Chars[5] = Chars[4];
Chars[4] = Chars[3];
Chars[3] = Chars[2];
Chars[2] = Chars[1];
Chars[1] = UPCASE(Char);
******;
RETURN;
******;
***********;
Check_Chars:
***********;
IF Chars[5] = 'V' AND
Chars[4] = 'I' AND
Chars[3] = 'D' AND
Chars[2] = 'E' AND
Chars[1] = Line_Feed THEN
DO;
&Cmnt PUTLOG "NOTE- *** VIDE and Line feed detected *** " _N_= Record_Cnt= Chars[5]= Chars[4]= Chars[3]= Chars[2]= Chars[1]= Char=;
In_Quotes = 0;
PUT Carriage_Return $1.;
PUT Line_Feed $1.;
Record_Cnt + 1;
END;
ELSE
IF Char = Double_Quote THEN
DO;
&Cmnt PUTLOG "NOTE- *** Double quote detected *** " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- *** In_Quotes set to false *** " _N_= Record_Cnt= Char=;
In_Quotes = 0;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- *** In_Quotes set to true *** " _N_= Record_Cnt= Char=;
In_Quotes = 1;
END;
PUT Double_Quote $1.;
END;
ELSE
IF Char = Line_Feed THEN
DO;
&Cmnt PUTLOG "NOTE- *** Stand alone line feed detected *** " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- ### Chg embedded line feed to space ### " _N_= Record_Cnt= Char=;
PUT Space $1.;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- %%% Change line feed to semicolon %%% " _N_= Record_Cnt= Char=;
PUT Semi_Colon $1.;
END;
END;
ELSE
IF Char = Carriage_Return THEN
DO;
&Cmnt PUTLOG "NOTE- *** Stand alone carriage return detected *** " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- ### Change embedded CR to space ### " _N_= Record_Cnt= Char=;
PUT Space $1.;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- @@@ Chg carriage return to semicolon @@@ " _N_= Record_Cnt= Char=;
PUT Semi_Colon $1.;
END;
END;
ELSE
IF Char = Semi_Colon THEN
DO;
&Cmnt PUTLOG "NOTE- Stand alone semi colon detected " _N_= Record_Cnt= Char=;
IF In_Quotes THEN
DO;
&Cmnt PUTLOG "NOTE- ^^^ Chg embedded semi colon to space^^^ " _N_= Record_Cnt= Char=;
PUT Space $1.;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- Retaining semi colon " _N_= Record_Cnt= Char=;
PUT Semi_Colon $1.;
END;
END;
ELSE
DO;
*&Cmnt PUTLOG "NOTE- Other character detected " _N_= Record_Cnt= Char=;
PUT Char $1.;
END;
******;
RETURN;
******;
RUN;
FILENAME Elig 'XXX\Marie_Eligibles_c.csv_20200910 (2)_SAS_Pre_Edit.csv';
data EUROPERF_IMP1_SAS;
INFILE Elig
LRECL = &LRECL
TERMSTR = CRLF
DLM = ';'
FIRSTOBS = 2
RECFM = V
MISSOVER
DSD
;
/* 56 Columns */
informat CodeIsin $12.;
informat Libelle $50.;
informat nature $12.;
informat categAMF $32.;
informat categEPF $50.;
informat zoneGeo $40.;
informat dureeMinReco $8.;
informat affectationResultat $40.;
informat fiscaliteText $50.;
informat deviseComptable $3.;
informat dateCloture $32.;
informat dateOuverture $32.;
informat promoteur $50.;
informat adressePromoteur $80.;
informat steGestion $70.;
informat adresseSteGestion $80.;
informat steAdmDelegue $80.;
informat steFinAdmDelegue $80.;
informat depositaire $80.;
informat adresseDepositaire $80.;
informat commissaireCpte $50.;
informat souscription $100.;
informat rachat $50.;
informat fraisGestion $70.;
informat datePerformance $12.;
informat perf12mois BEST32.;
informat perf36mois $5.;
informat perf60mois $5.;
informat fraisGestionCommentaire $1024.;
informat objectifGestion $1024.;
informat souscripteurs $1024.;
informat orientationPlacement $1024.;
informat dateClotureExercice $1024.;
informat periodeVL $1024.;
informat sitePublication $1024.;
informat conditionSouscripRachat $1024.;
informat dateValidation $12.;
informat minimumRisqueAction $5.;
informat dateAgrementAMF $12.;
informat Tfe $20.;
informat perfAm1 $8.;
informat perfAm2 $8.;
informat perfAm3 $8.;
informat perfAm4 $8.;
informat perfAm5 $8.;
informat perfAm6 $8.;
informat perfAm7 $8.;
informat perfAm8 $8.;
informat UnitedureeMinReco $7.;
informat perfYTD $5.;
informat volat_12mois $5.;
informat frequence_text $30.;
informat prospectus_url $50.;
informat jour_valorisation $8.;
informat risque $6.;
informat VIDE $4.;
/* 56 Columns */
format CodeIsin $12.;
format Libelle $50.;
format nature $12.;
format categAMF $32.;
format categEPF $50.;
format zoneGeo $40.;
format dureeMinReco $8.;
format affectationResultat $40.;
format fiscaliteText $50.;
format deviseComptable $3.;
format dateCloture $32.;
format dateOuverture $32.;
format promoteur $50.;
format adressePromoteur $80.;
format steGestion $70.;
format adresseSteGestion $80.;
format steAdmDelegue $80.;
format steFinAdmDelegue $80.;
format depositaire $80.;
format adresseDepositaire $80.;
format commissaireCpte $50.;
format souscription $100.;
format rachat $5.;
format fraisGestion $70.;
format datePerformance $12.;
format perf12mois BEST32.;
format perf36mois $5.;
format perf60mois $5.;
format fraisGestionCommentaire $1024.;
format objectifGestion $1024.;
format souscripteurs $1024.;
format orientationPlacement $1024.;
format dateClotureExercice $1024.;
format periodeVL $1024.;
format sitePublication $1024.;
format conditionSouscripRachat $1024.;
format dateValidation $12.;
format minimumRisqueAction $5.;
format dateAgrementAMF $12.;
format Tfe $20.;
format perfAm1 $8.;
format perfAm2 $8.;
format perfAm3 $8.;
format perfAm4 $8.;
format perfAm5 $8.;
format perfAm6 $8.;
format perfAm7 $8.;
format perfAm8 $8.;
format UnitedureeMinReco $7.;
format perfYTD $5.;
format volat_12mois $5.;
format frequence_text $30.;
format prospectus_url $50.;
format jour_valorisation $8.;
format risque $6.;
format VIDE $4.;
/* 56 Columns */
input
CodeIsin $
Libelle $
nature $
categAMF $
categEPF $
zoneGeo $
dureeMinReco $
affectationResultat $
fiscaliteText $
deviseComptable $
dateCloture $
dateOuverture $
promoteur $
adressePromoteur $
steGestion $
adresseSteGestion $
steAdmDelegue $
steFinAdmDelegue $
depositaire $
adresseDepositaire $
commissaireCpte $
souscription $
rachat $
fraisGestion $
datePerformance $
perf12mois
perf36mois $
perf60mois $
fraisGestionCommentaire $
objectifGestion $
souscripteurs $
orientationPlacement $
dateClotureExercice $
periodeVL $
sitePublication $
conditionSouscripRachat $
dateValidation $
minimumRisqueAction $
dateAgrementAMF $
Tfe $
perfAm1 $
perfAm2 $
perfAm3 $
perfAm4 $
perfAm5 $
perfAm6 $
perfAm7 $
perfAm8 $
UnitedureeMinReco $
perfYTD $
volat_12mois $
frequence_text $
prospectus_url $
jour_valorisation $
risque $
VIDE $;
run;
Generally speaking, I don't think you should replace CR with LF. You are welcome to try it, but I don't think that will work with your data. At least in a Windows environment, the CR and LF are meant to work together to indicate the end of a record. The little pre-edit program that I wrote for you should be able to prepare the data for you. You should not have to do any work with Notepad++, at least not on the data I have seen so far.
However, Notepad++ is a good tool, and at some point, the little pre-edit program that I wrote may encounter data that it doesn't know how to handle in which case you may need to do some manual editing before running the pre-edit program.
It's very easy to see the CR and LF characters in Notepad++.
Thereafter, you should be able to see a CR for all carriage return characters and a LF for all line feed characters.
Jim
Yes, your data should have a semi-colon as the data delimiter and a CRLF as the observation delimiter. It is however good practice to enclose text in double quotes, like the following:
"Fixe 0.00 %";"Fixe 0.01 %"
The semi-colon is the delimiter, but Fixe 0.00 % and Fixe 0.01 % are enclosed in double quotes.
There should be no LF characters by themselves alone in the data. The LF character should only be used with a CR; they should always be used together (CRLF); they should always indicate "end of observation."
Jim
Hello Jim,
My problem is that I don’t know how to control the futur file.
Normaly, after correction, I’ll have only the « CRLF » in the end of observation.
For making a test, I would like to put in Notepad ++/Find (« Rechercher » : ‘0A’x) : just to see that I don’t have the soul « LF », but it doesn’t work (the example is attached):
Maybe you know how to check if there are soul « LF ».
Thank you!
For Notepad++, you will want to select Mode étendu.
When you want to search for a line feed (LF), use \n.
When you want to search for a carriage return (CR), use \r.
When you want to search for both together, use \r\n.
Jim
No, you do not need to use Notepad++. I wrote you a small Data step that you need to run before your main Data step. The small Data step will take care of the bad Line Feed characters for you.
Jim
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.