Hello,
I need to import the csv file, my second variable (start and end in yellow) has a line break :
I use the option: dlm='","'||',"';
infile "XXX\tweet_archives_fr.csv" delimiter=dlm
missover lrecl=250 firstobs=2 ;
With this options my second variable is truncated.
Do you know please the option to read correctly this variable ?
Thank you!
Best regards,
Marie
So it looks like something in your process took your original CSV (with the line breaks in it) and added quotes around every line. Those can be removed by either using DEQUOTE() function or reading the lines using DSD option.
But some of the embedded line breaks make lines without any commas and that is making figuring out which line breaks are the real end of line difficult.
Fortunately the first value on each row is a long number, so you can use that fact to do a better job of finding where the real lines start.
So here is a data step to convert your original file into a readable file.
filename fixed temp;
data _null_;
infile original dsd truncover lrecl=32767;
file fixed lrecl=1000000;
input line :$char32767. ;
if notdigit(scan(line,1,','))
and length(scan(line,1,','))>12 then put '|' @;
else put;
len=lengthn(line);
put line $varying32767. len @;
run;
Now you can read the file. You could even use PROC IMPORT to read the file, although the column headers do not make good variable names and PROC IMPORT will make its normal mistakes like treating the first column as a number instead of string.
proc import datafile=fixed dbms=csv out=want replace ;
run;
Hard to tell from a photograph of your text, but in general embedded end of line characters in a field is a problem for SAS. There are some work arounds.
The easiest is if the characters in your field are not exactly the same as those used to mark the real end of lines. So if the characters are either CR or LF and the end of line if CR and LF then you can just add TERMSTR=CRLF to your INFILE statement.
Otherwise you will either need to make a more complex solution.
For example if the fields that have end of line characters in them are quoted you can use that fact to find them and change them into some other character. For safety I would do this by making a copy of the file.
filename fixed temp;
data _null_;
infile "XXX\tweet_archives_fr.csv";
file fixed;
input;
q+countc(_infile_,'"');
q=mod(q,2);
if _n_>1 and not q then put;
put _infile_ @;
run;
....
infile fixed dsd dlm=',' truncover firstobs=2;
The data is attached.
Thank you
I don't really see a pattern in that file. It does appear to have way too many quotes.
Here is summary and dump of the first 20 records.
Is it possible to get the data from before someone tried to create a CSV file out of it?
144 data _null_; 145 infile "&path/&fname" obs=20; 146 input; 147 q=countc(_infile_,'"'); 148 len=lengthn(_infile_); 149 put (_n_ q len) (=) ; 150 run; NOTE: The infile "C:\Downloads/tweet_archives - Marie.csv" is: Filename=C:\Downloads\tweet_archives - Marie.csv, RECFM=V,LRECL=32767,File Size (bytes)=12275, Last Modified=10Jul2020:11:26:23, Create Time=10Jul2020:11:26:22 _N_=1 q=158 len=1132 _N_=2 q=8 len=195 _N_=3 q=0 len=61 _N_=4 q=152 len=336 _N_=5 q=8 len=247 _N_=6 q=152 len=364 _N_=7 q=158 len=616 _N_=8 q=8 len=168 _N_=9 q=0 len=29 _N_=10 q=0 len=50 _N_=11 q=0 len=39 _N_=12 q=152 len=387 _N_=13 q=166 len=651 _N_=14 q=166 len=565 _N_=15 q=158 len=680 _N_=16 q=8 len=253 _N_=17 q=152 len=335 _N_=18 q=166 len=573 _N_=19 q=8 len=216 _N_=20 q=152 len=333 NOTE: 20 records were read from the infile "C:\Downloads/tweet_archives - Marie.csv". The minimum record length was 29. The maximum record length was 1132. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 151 data _null_; 152 infile "&path/&fname" obs=20; 153 input; 154 list; 155 run; NOTE: The infile "C:\Downloads/tweet_archives - Marie.csv" is: Filename=C:\Downloads\tweet_archives - Marie.csv, RECFM=V,LRECL=32767,File Size (bytes)=12275, Last Modified=10Jul2020:11:26:23, Create Time=10Jul2020:11:26:22 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 1 "Identifiant du Tweet,""Permalien du Tweet"",""Texte du Tweet"",""heure"",""impressions"",""engag 98 ements"",""taux d’engagement"",""Retweets"",""réponses"",""J'aime"",""clics sur le profil de l 195 'utilisateur"",""clics sur l'URL"",""clics sur le hashtag"",""ouvertures des détails"",""clics s 292 ur le permalien"",""ouvertures d'application"",""installations de l'application"",""abonnements " 389 ",""envoyer le Tweet par email"",""composer le numéro"",""vues du média"",""engagements avec le 486 média"",""impressions sponsorisé"",""engagements sponsorisé"",""taux d’engagement sponsoris 583 é"",""Retweets sponsorisé"",""réponses sponsorisé"",""J'aime sponsorisé"",""clics sur le pro 680 fil de l'utilisateur sponsorisé"",""clics sur l'URL sponsorisé"",""clics sur le hashtag sponsor 777 isé"",""ouvertures des détails sponsorisé"",""clics sur le permalien sponsorisé"",""ouverture 874 s d'application sponsorisé"",""installations de l'application sponsorisé"",""abonnements spons 971 orisé"",""envoyer le Tweet par email sponsorisé"",""composer le numéro sponsorisé"",""vues du 1068 média sponsorisé"",""engagements avec le média sponsorisé""" 1132 2 "1266244129173893121,""https://twitter.com/ACPPLMP/status/1266244129173893121"",""💰🌂Option 98 Capital Protégé Décembre 2024 III : 🔛 J-8 avant la fin de la commercialisation de cette OCP 195 " 195 3 #UCprotégée #investissement #assurancevie #diversification 61 4 "https://t.co/dbBiwVNaYw https://t.co/lCbVXrlG3o"",""2020-05-29 05:45 +0000"",""608.0"",""12.0"", 98 ""0.019736842105263157"",""2.0"",""0.0"",""2.0"",""0.0"",""1.0"",""0.0"",""6.0"",""0.0"",""0"","" 195 0"",""0"",""0"",""0"",""1"",""1"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""- 292 "",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 336 5 "1265912925774131200,""https://twitter.com/ACPPLMP/status/1265912925774131200"",""💡@phcrevel d 98 u @twittcercle nous éclaire dans ce dossier consacré à « L’épargne des Français en temps 195 de crise » réalisé avec le support d'@amphitea_ " 247 6 "👉 #Epargne #AssuranceVie https://t.co/Q8oNOJtxFP https://t.co/yP6VAdOdya"",""2020-05-28 07:48 98 +0000"",""820.0"",""19.0"",""0.023170731707317073"",""2.0"",""0.0"",""4.0"",""0.0"",""6.0"",""0. 195 0"",""3.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""4"",""4"",""-"",""-"",""-"",""-"",""-"",""-"" 292 ,""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 364 7 "1265573207765114880,""https://twitter.com/ACPPLMP/status/1265573207765114880"",""...Ceci, dans l 98 'intérêt des clients et pour répondre aux demandes de nos partenaires, dans des conditions de RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 195 travail spécifiques et surtout respectueuses de la sécurité pour tous.👩â€.âš•ï¸. #Covid_19 ZONE 7767666277CA66667767267277776772767766776776726626627CA67767CA27677276772F99AE88E99EB822467665332 NUMR 421619C030393969155305403524F54025305345553530450C10339352943900F5204F53E0F1920D2A5F8F033F694F190 292 #télétravail #santé #déconfinement"",""2020-05-27 09:19 +0000"",""667.0"",""6.0"",""0.0089955 389 02248875561"",""1.0"",""0.0"",""3.0"",""0.0"",""0.0"",""0.0"",""2.0"",""0.0"",""0"",""0"",""0""," 486 "0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"","" 583 -"",""-"",""-"",""-"",""-"",""-""" 616 8 "1265573206028685313,""https://twitter.com/ACPPLMP/status/1265573206028685313"",""...à tous les 98 collaborateurs venus sur site depuis le début du confinement.ðŸ™.ðŸ™." 168 ZONE 66666667676777276677277727676266776726626CA6772672666666666672F998F9982 NUMR 3FCC12F2145523065E5303520394504505930C5043925404503FE69E5D5E4E0F9F0F9F2 9 Grâce à eux les opérations 29 10 CHAR ☑ï¸. essentielles à la poursuite de l'activité 50 ZONE E99EB826776676666672CA26627677776762662626676767CA NUMR 281F8F05335E495CC530300C100F52359450450C7134969439 11 CHAR ☑ï¸. réalisables uniquement sur site 39 ZONE E99EB827CA66676666727667766667277727676 NUMR 281F8F02391C9312C5305E9155D5E4035203945 12 "ont pu être traitées. #Covid_19 #télétravail #santé #déconfinement https://t.co/YfhUNYTK9 98 l"",""2020-05-27 09:19 +0000"",""1167.0"",""23.0"",""0.01970865467009426"",""3.0"",""1.0"",""11.0 195 "",""1.0"",""0.0"",""0.0"",""6.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""1"",""1"",""-"",""-"", 292 ""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 387 13 "1265226495749959681,""https://twitter.com/ACPPLMP/status/1265226495749959681"",""""""Il faudra c 98 ependant regarder si les épargnants ont plutôt préféré gonfler leur fonds euros, par sécuri 195 té, ou plutôt investi sur des unités de compte ou des mandats de gestion pour profiter des opp 292 ortunités liées à la fluctuations des marchés.""""#Unitésdecompte 💰"",""2020-05-26 10:21 389 +0000"",""638.0"",""7.0"",""0.0109717868338558"",""1.0"",""0.0"",""2.0"",""0.0"",""0.0"",""2.0"", 486 ""2.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""- RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 583 "",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 651 14 "1265226494357442561,""https://twitter.com/ACPPLMP/status/1265226494357442561"",""""""La collecte 98 nette sur les contrats d’assurance vie devrait être abondante pour le mois d’avril 2020.""" 195 " @phcrevel @gestionfortune #assurancevie #épargne 🪙💰https://t.co/jY4nW5CS1V"",""2020-05-2 292 6 10:21 +0000"",""725.0"",""11.0"",""0.015172413793103448"",""2.0"",""1.0"",""2.0"",""0.0"",""1.0 389 "",""0.0"",""5.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-" 486 ",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 565 15 "1264794578751758337,""https://twitter.com/ACPPLMP/status/1264794578751758337"",""🖌 #ACPPLAMON 98 DIALE #Story 1ï¸.⃣ Depuis plus de 30 ans, nous accompagnons nos partenaires dans la distributio ZONE 44444225767723EB8E8A24677672767726623326672266772666667666667266727677666676726667266266777667766 NUMR 491C50334F2901F8F233045059300C5304503001E3C0EF530133FD017EFE30EF3001245E19253041E30C104934292549F 195 n de solutions patrimoniales adaptées aux besoins de leurs clients. (Re) découvrez notre histoi 292 re 👉https://t.co/sOePG4dTBs #PrendreLaMainSurDemain #AssuranceVie #Epargne https://t.co/rVuYGL 389 ZSZv"",""2020-05-25 05:45 +0000"",""938.0"",""39.0"",""0.04157782515991471"",""5.0"",""0.0"",""6. 486 0"",""0.0"",""5.0"",""1.0"",""21.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""1"",""1"",""-"",""-" 583 ",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"" 680 " 680 16 CHAR "1263007345791172608,""https://twitter.com/ACPPLMP/status/1263007345791172608"",""ðŸ.†L'Alpha Lea ZONE 2333333333333333333322267777322776776726662445544527767772333333333333333333322222F98842467662466 NUMR 21263007345791172608C2284403AFF4794452E3FDF1300CD0F341453F126300734579117260822C220FF6C71C0810C51 98 gue Table 2020 positionne ACPP LA MONDIALE Gestion d'actifs dans le ðŸ”.10 des 330 sociétés de ZONE 67625666623333276767666662445524424444444424677666262667667266672662F999332667233327666CA7CA72662 NUMR 7550412C50202000F3949FEE5013000C10DFE491C5075349FE047134963041E30C500F4D100453033003F393943930450 195 gestion intervenant sur les marchés d’actions en France" 253 17 "https://t.co/1HibRbsocs https://t.co/VR8kGoiUv8"",""2020-05-20 07:23 +0000"",""1017.0"",""7.0"", 98 ""0.00688298918387414"",""1.0"",""0.0"",""2.0"",""0.0"",""1.0"",""0.0"",""3.0"",""0.0"",""0"",""0 195 "",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-" 292 ",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 335 18 "1262620248580263936,""https://twitter.com/ACPPLMP/status/1262620248580263936"",""""""L'un des ef RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 98 fets induits du #Covid-19 a été l'intensification spectaculaire de l'utilisation du digital dan 195 s les entreprises."""" #TransfoNum #Digital 👨â€.💻👩â€.💻 https://t.co/mIdJFmPf5m"",""20 ZONE 72667266776776767222222257667664762246667662F99AE88F99BF99AE88F99B2677773227266264644656362222233 NUMR 30C5305E425029353E222203421E36FE5D03497941C00F1820D0F2B0F1920D0F2B084403AFF4E3FFD94A6D065D22C2220 292 20-05-19 05:45 +0000"",""1255.0"",""12.0"",""0.009561752988047808"",""2.0"",""0.0"",""5.0"",""0.0 389 "",""2.0"",""0.0"",""3.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"","" 486 -"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 573 19 "1262292543426965505,""https://twitter.com/ACPPLMP/status/1262292543426965505"",""Tout concourt à 98 la digitalisation d’une profession qui continue à s’appuyer sur les compétences de ses co 195 nseillers.👨â€.💻" 216 ZONE 6766666772F99AE88F99B2 NUMR E359CC523E0F1820D0F2B2 20 " #coconstruction #digitalbydesign #formation"",""2020-05-18 08:02 +0000"",""1180.0"",""4.0"",""0 98 .003389830508474576"",""1.0"",""0.0"",""1.0"",""0.0"",""0.0"",""0.0"",""2.0"",""0.0"",""0"",""0"" 195 ,""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"", 292 ""-"",""-"",""-"",""-"",""-"",""-"",""-""" 333 NOTE: 20 records were read from the infile "C:\Downloads/tweet_archives - Marie.csv". The minimum record length was 29. The maximum record length was 1132. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
A CSV file is not an XLSX file. If they can send you an XLSX file it would be easier to deal with.
You could try seeing if Excel could do a better job of loading the file, but it does not look like a valid CSV file at all, so I doubt that Excel will have any better luck with it. But if you could get it loaded into a XLSX file then you could read that file with the XLSX engine.
So it looks like something in your process took your original CSV (with the line breaks in it) and added quotes around every line. Those can be removed by either using DEQUOTE() function or reading the lines using DSD option.
But some of the embedded line breaks make lines without any commas and that is making figuring out which line breaks are the real end of line difficult.
Fortunately the first value on each row is a long number, so you can use that fact to do a better job of finding where the real lines start.
So here is a data step to convert your original file into a readable file.
filename fixed temp;
data _null_;
infile original dsd truncover lrecl=32767;
file fixed lrecl=1000000;
input line :$char32767. ;
if notdigit(scan(line,1,','))
and length(scan(line,1,','))>12 then put '|' @;
else put;
len=lengthn(line);
put line $varying32767. len @;
run;
Now you can read the file. You could even use PROC IMPORT to read the file, although the column headers do not make good variable names and PROC IMPORT will make its normal mistakes like treating the first column as a number instead of string.
proc import datafile=fixed dbms=csv out=want replace ;
run;
@SASdevAnneMarie wrote:
Hello Tom,
I have a little question: how SAS interpret the symbol 'I' while proc import.
If I understood correctly the data are not truncated because of 'I', but I can't understand the choice of 'I'.
Thank you very much!
It is not there for SAS to understand, just for humans. I choose that character to replace the end of line characters just in case there was some need to know where they used to be. I picked the pipe character as it is unlikely to be in your actual data. You could use a space or not write anything at all to replace the end of line characters that are being removed.
The issue is you have corrupted file. It has both extra end of line characters embedded in the middle of some of the lines and extra quotes (for some unexplained reason). The purpose of the data step is to create a version of the file that has both issues corrected. Once that is done then the corrected file can be read. One way to read the file is to use PROC IMPORT. But you could also just write your own data step to read the file. You don't have to make a temporary file. You could instead create a permanent file somewhere. One simple way would be to change the FILENAME statement to point to an actual filename that you could write to instead of using the TEMP engine to generate a temporary file.
The data step itself if pretty simple. It reads one line (removing the extra quotes because it is using the DSD option). Then it decides whether it is the start of a new line of data based on the first value on that line. When it is a new record it writes the end of line for the previous record. Otherwise it writes a pipe character to replace the end of line that is going to be removed. Then it writes the line (removing any trailing blanks).
If you have specific questions about the data step please ask.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.