Hello Experts,
I can't import the attached file.
I don't know what is wrong with my code, before it worked well.
Now the data is completely wrong from the line 756 (if don't edit the file).
If I edit the file (even just delete the lines to make the file more light) the data is COMPLITELY wrong.
The original file is very huge, I can't download it, even with compress, I attached the edited file.
data EUROPERF_IMP1;
dlm=';'||'0D0A'x;
infile "XXXXXX\Eligibles_c.csv_20200810.csv" dlm=dlm DSD lrecl=32867 /*termstr=CRLF*/
RECFM=n;
informat CodeIsin $12.;
informat Libelle $50.;
informat nature $12.;
informat categAMF $32.;
informat categEPF $50.;
informat zoneGeo $40.;
informat dureeMinReco $8.;
informat affectationResultat $40.;
informat fiscaliteText $50.;
informat deviseComptable $3.;
informat dateCloture $32.;
informat dateOuverture $32.;
informat promoteur $50.;
informat adressePromoteur $80.;
informat steGestion $70.;
informat adresseSteGestion $80.;
informat steAdmDelegue $80.;
informat steFinAdmDelegue $80.;
informat depositaire $80.;
informat adresseDepositaire $80.;
informat commissaireCpte $50.;
informat souscription $100.;
informat rachat $50.;
informat fraisGestion $70.;
informat datePerformance $12.;
informat perf12mois best32.;
informat perf36mois $5.;
informat perf60mois $5.;
informat fraisGestionCommentaire $256.;
informat objectifGestion $256.;
informat souscripteurs $256.;
informat orientationPlacement $256.;
informat dateClotureExercice $256.;
informat periodeVL $256.;
informat sitePublication $256.;
informat conditionSouscripRachat $256.;
informat dateValidation $12.;
informat minimumRisqueAction $5.;
informat dateAgrementAMF $12.;
informat Tfe $20.;
informat perfAm1 $8.;
informat perfAm2 $8.;
informat perfAm3 $8.;
informat perfAm4 $8.;
informat perfAm5 $8.;
informat perfAm6 $8.;
informat perfAm7 $8.;
informat perfAm8 $8.;
informat UnitedureeMinReco $7.;
informat perfYTD $5.;
informat volat_12mois $5.;
informat frequence_text $30.;
informat prospectus_url $50.;
informat jour_valorisation $8.;
informat risque $6.;
informat VIDE $4.;
format CodeIsin $12.;
format Libelle $50.;
format nature $12.;
format categAMF $32.;
format categEPF $50.;
format zoneGeo $40.;
format dureeMinReco $8.;
format affectationResultat $40.;
format fiscaliteText $50.;
format deviseComptable $3.;
format dateCloture $32.;
format dateOuverture $32.;
format promoteur $50.;
format adressePromoteur $80.;
format steGestion $70.;
format adresseSteGestion $80.;
format steAdmDelegue $80.;
format steFinAdmDelegue $80.;
format depositaire $80.;
format adresseDepositaire $80.;
format commissaireCpte $50.;
format souscription $100.;
format rachat $5.;
format fraisGestion $70.;
format datePerformance $12.;
format perf12mois best32.;
format perf36mois $5.;
format perf60mois $5.;
format fraisGestionCommentaire $256.;
format objectifGestion $256.;
format souscripteurs $256.;
format orientationPlacement $256.;
format dateClotureExercice $256.;
format periodeVL $256.;
format sitePublication $256.;
format conditionSouscripRachat $256.;
format dateValidation $12.;
format minimumRisqueAction $5.;
format dateAgrementAMF $12.;
format Tfe $20.;
format perfAm1 $8.;
format perfAm2 $8.;
format perfAm3 $8.;
format perfAm4 $8.;
format perfAm5 $8.;
format perfAm6 $8.;
format perfAm7 $8.;
format perfAm8 $8.;
format UnitedureeMinReco $7.;
format perfYTD $5.;
format volat_12mois $5.;
format frequence_text $30.;
format prospectus_url $50.;
format jour_valorisation $8.;
format risque $6.;
format VIDE $4.;
input
CodeIsin $
Libelle $
nature $
categAMF $
categEPF $
zoneGeo $
dureeMinReco $
affectationResultat $
fiscaliteText $
deviseComptable $
dateCloture $
dateOuverture $
promoteur $
adressePromoteur $
steGestion $
adresseSteGestion $
steAdmDelegue $
steFinAdmDelegue $
depositaire $
adresseDepositaire $
commissaireCpte $
souscription $
rachat $
fraisGestion $
datePerformance $
perf12mois
perf36mois $
perf60mois $
fraisGestionCommentaire $
objectifGestion $
souscripteurs $
orientationPlacement $
dateClotureExercice $
periodeVL $
sitePublication $
conditionSouscripRachat $
dateValidation $
minimumRisqueAction $
dateAgrementAMF $
Tfe $
perfAm1 $
perfAm2 $
perfAm3 $
perfAm4 $
perfAm5 $
perfAm6 $
perfAm7 $
perfAm8 $
UnitedureeMinReco $
perfYTD $
volat_12mois $
frequence_text $
prospectus_url $
jour_valorisation $
risque $
VIDE $;
run;
Thank you for your help!
Best regards,
Marie
I have written a new SAS program. This new SAS program must be run before the main import program. This new SAS program requires that each record end with "VIDELF" or the program will not run correctly.
The new SAS program does several things:
Here is the code:
%LET Cmnt = *;
%LET LRECL = 32767;
%LET Limit = 50000;
FILENAME Elig_In 'XXX\Marie_Eligibles_c.csv_20200910 (2).csv';
FILENAME Elig_Out 'XXX\Marie_Eligibles_c.csv_20200910 (2)_SAS_Pre_Edit.csv';
DATA _NULL_;
RETAIN Carriage_Return '0D'x;
RETAIN Line_Feed '0A'x;
RETAIN Semi_Colon ';';
%IF %BQUOTE(&Cmnt) = %BQUOTE() %THEN
%DO;
IF _N_ > &Limit THEN
DO;
PUTLOG "NOTE- ";
PUTLOG "NOTE: Stopping " _N_=;
STOP;
END;
%END;
INFILE Elig_In
RECFM = N
;
FILE Elig_Out
LRECL = &LRECL
RECFM = N
;
ARRAY Chars [5] $1 _TEMPORARY_;
INPUT Char $1.;
LINK Load_Array;
LINK Check_Chars;
******;
RETURN;
******;
**********;
Load_Array:
**********;
Chars[5] = Chars[4];
Chars[4] = Chars[3];
Chars[3] = Chars[2];
Chars[2] = Chars[1];
Chars[1] = UPCASE(Char);
******;
RETURN;
******;
***********;
Check_Chars:
***********;
IF Chars[5] = 'V' AND
Chars[4] = 'I' AND
Chars[3] = 'D' AND
Chars[2] = 'E' AND
Chars[1] = Line_Feed THEN
DO;
&Cmnt PUTLOG "NOTE- *** VIDE and Line feed detected *** " _N_= Chars[5]= Chars[4]= Chars[3]= Chars[2]= Chars[1]= Char=;
PUT Carriage_Return $1.;
PUT Line_Feed $1.;
END;
ELSE
DO;
IF Char = Line_Feed THEN
DO;
&Cmnt PUTLOG "NOTE- *** Stand alone line feed detected *** " _N_= Char=;
PUT Semi_Colon $1.;
END;
ELSE
IF Char = Carriage_Return THEN
DO;
&Cmnt PUTLOG "NOTE- *** Stand alone carriage return detected *** " _N_= Char=;
PUT Semi_Colon $1.;
END;
ELSE
DO;
&Cmnt PUTLOG "NOTE- Other character detected " _N_= Char=;
PUT Char $1.;
END;
END;
******;
RETURN;
******;
RUN;
After you run the above new SAS program, you can run the import program with no other steps in between.
This new SAS program will not take care of record 9 which has LF's that are not separators and are not terminators. It may be possible to remove these, but that will take extra programming, and I need to get back to my regular job. Really, if the data is this bad, the person who creates the data should fix the data. If the data cannot be fixed, then the company in charge or reading the data needs to hire a consultant or contractor who has the specialized programming skill needed to work with this kind of data. I would be happy to do it, but I cannot work for free always. Usually, I try to give advice or tips so that the person posting to the SAS Community can then continue on their own, but here I am 100% writing the entire program which is not exactly the spirit of the SAS Community as I understand it. The SAS Community is here to assist but not to be free labor. I hope you understand.
If you have any problems with the program, please let me know, and I will he happy to assist. If the programs serve you, please mark one of my posts as the solution so that I may at least have a little recognition for my work. I would appreciate that.
Jim
Marie,
My first reaction is that this code doesn't look right to me:
dlm=';'||'0D0A'x;
This code is saying that each column should be separated by three characters, taken together:
When I look at the data, I only see the semi-colon between each field. I would change the delimiter setting to only the below and try re-running.
dlm=';'
Jim
Yes, I now understand more, and you are correct. Please see my latest response, the very last one, below.
Jim
I changed the program slightly as follows:
data EUROPERF_IMP1;
infile "R:\users\jbarbou3\Eligibles_c.csv_20200810.csv"
MISSOVER
/* termstr=CRLF*/
lrecl=32867
DSD
dlm=';'
RECFM=n
;
I ran the program, and the data looks like this:
Do you see the diagonal pattern in the data? Whenever I have seen a diagonal pattern like that in the past, that typically means that there is a mis-match between the number of columns defined in the INFILE vs. the number of delimiters (in this case a semi-colon) physically present in the data. I tried adding a MISSOVER parameter, but that did not help.
Has the data changed or been redefined in any way recently?
Jim
Your file has multiple end of line characters inserted into the middle of the lines of data so not all lines have the same number of values.
data check;
infile "&path/&fname" termstr=crlf;
row+1;
input;
words=countw(_infile_,';','mq');
run;
proc freq ; tables words; run;
The FREQ Procedure Cumulative Cumulative words Frequency Percent Frequency Percent ---------------------------------------------------------- 16 1 0.82 1 0.82 20 1 0.82 2 1.64 29 1 0.82 3 2.46 30 1 0.82 4 3.28 31 5 4.10 9 7.38 32 1 0.82 10 8.20 33 22 18.03 32 26.23 34 25 20.49 57 46.72 35 11 9.02 68 55.74 36 3 2.46 71 58.20 43 1 0.82 72 59.02 50 2 1.64 74 60.66 52 1 0.82 75 61.48 53 2 1.64 77 63.11 54 4 3.28 81 66.39 55 11 9.02 92 75.41 56 30 24.59 122 100.00
There do not appear to be quotes around them or any other way to safely distinguish the spurios linebreaks from the real end of lines.
Can you have the file recreated without those characters?
I would agree with @Tom that the best thing would be to have the data re-created with the following in mind:
However, if it would be difficult to re-create the data, the data can be made to work. I made a few changes to the program and the data, and I got the following:
Notice that the data is now aligned properly.
Here is what I did:
First, the data sometimes contained Line Feed characters ('0A'x) as delimiters and sometimes the data contained semi-colons as delimiters. I changed all the solo Line Feed characters ('0A'x) to a semi-colon. Note that however each line is terminated by a combination Carriage Return ('0D'x) Line Feed ('0A'x). The combination must be preserved. In order to preserve the combinations, I first changed all the combinations ('0D0A'x) to '^^^'. Then I changed all the solo Line Feeds to a semi-colon. Then I changed all the '^^^' back to a combination Carriage Return Line Feed ('0D0A'x). It's actually easier that my description may make it sound. I did this in Notepad++ but any good editor should suffice.
Second, I changed the INFILE statement as follows:
infile "XXXXXXXXXX\Eligibles_c.csv_20200810.csv"
termstr=CRLF
lrecl=32767
FIRSTOBS=2
MISSOVER
RECFM=V
dlm=';'
DSD
;
The important changes are:
I did change the LRECL slightly, but this is not too important as long as the LRECL is longer than the longest record. If you start getting truncation, you should increase this parameter.
So, in conclusion, I fully agree with @Tom that the best thing would be to go back to whoever created the data in the first place and have them re-create the data correctly. However, if you cannot reasonably have the data re-created, then you can change the solo Line Feed characters to semi-colons, make the program changes I described above, and the program will work correctly.
I hope this is helpful,
Jim
I used Notepad++ not regular Notepad. There are probably other editors that would work as well, but I believe that Notepad++ is free.
Jim
@SASdevAnneMarie wrote:
Thank you!
Is it possible to make this changes automatically? I’ll receive this type of file for Input for my program.:(
You need to talk to the source for this file and see if they can adjust there process to produce a file that can be parsed. Find out what software they are using to create the file and perhaps there are simple changes in how they create the file that could eliminate this issue.
Ask them to remove any carriage return ('0D'x) or linefeed ('0A'x) values from the data fields. They can either just strip them out or replace them with some other character(s). Make sure that fields are separated by the same delimiter. And that any value that contains the delimiter is enclosed in quotes. Any values that contain quotes are also enclosed in quotes and the existing quotes are doubled up.
You could try to recreate the manual step of replacing linefeeds with semi-colons.
filename copy temp;
data _null_;
infile "&path/&fname" termstr=crlf ;
input;
_infile_=translate(_infile_,';','0A'x);
file copy;
put _infile_;
run;
But when I do that I notice that some of the lines are close to the limit of 32K bytes that can be handled by the _INFILE_ automatic variable.
The minimum record length was 717. The maximum record length was 32489.
So you might need a more complicated process instead that reads the file character by character instead.
@Tom,
I noticed that as well, that the some of the records in the data are fairly close to the 32,767 limit. I wrote a quick program that can replace the Line Feed characters with semi-colons and handle records longer than 32,767.
%LET LRECL = 64000;
DATA _NULL_;
INFILE "XXX\Eligibles_c.csv_20200810_Original.csv"
LENGTH = Rec_Leng
LRECL = &LRECL
TERMSTR = CRLF
RECFM = V
;
INPUT @;
IF Rec_Leng >= &LRECL THEN
DO;
CALL SYMPUTX('SYSCC', 8, 'G');
PUTLOG "ERROR- ";
PUTLOG "ERROR: Record length >= &LRECL..";
PUTLOG "ERROR- Truncation probable.";
PUTLOG "ERROR- This run is bad.";
PUTLOG "ERROR- Increase LRECL and re-run.";
ABORT CANCEL;
END;
IF Rec_Leng <= 32767 THEN
DO;
INPUT @1 Work1 $VARYING32767. Rec_Leng;
Work1 = TRANWRD(Work1, '0D0A'x, '^^^^');
Work1 = TRANWRD(Work1, '0A'x, ';');
Work1 = TRANWRD(Work1, '^^^^', '0D0A'x);
FILE "XXX\Eligibles_c.csv_20200810_SAS_Edit.csv"
LRECL = &LRECL
RECFM = V;
PUT @1 Work1;
END;
ELSE
DO;
Work_Leng = Rec_Leng - 32767;
INPUT @1 Work1 $CHAR32767.
@32768 Work2 $VARYING32767. Work_Leng;
Work1 = TRANWRD(Work1, '0D0A'x, '^^^^');
Work1 = TRANWRD(Work1, '0A'x, ';');
Work1 = TRANWRD(Work1, '^^^^', '0D0A'x);
Work2 = TRANWRD(Work2, '0D0A'x, '^^^^');
Work2 = TRANWRD(Work2, '0A'x, ';');
Work2 = TRANWRD(Work2, '^^^^', '0D0A'x);
FILE "R:\users\jbarbou3\Eligibles_c.csv_20200810_SAS_Edit.csv"
LRECL = &LRECL
RECFM = V;
PUT @1 Work1
@32768 Work2
;
END;
RUN;
I've run some tests, and it does the job. It's there if @SASdevAnneMarie encounters records that are longer than 32,767.
Jim
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.