BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SASdevAnneMarie
Barite | Level 11

Hello,

 

I need to import the csv file, my second variable (start and end in yellow) has a line break :

MarieT_0-1594390661295.png

I use the option:  dlm='","'||',"';
infile "XXX\tweet_archives_fr.csv" delimiter=dlm
missover lrecl=250 firstobs=2 ;

 

With this options my second variable is truncated.

Do you know please the option to read correctly this variable ?

Thank you!

 

Best regards,

Marie

  

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So it looks like something in your process took your original CSV (with the line breaks in it) and added quotes around every line.  Those can be removed by either using DEQUOTE() function or reading the lines using DSD option.

But some of the embedded line breaks make lines without any commas and that is making figuring out which line breaks are the real end of line difficult.

Fortunately the first value on each row is a long number, so you can use that fact to do a better job of finding where the real lines start.

So here is a data step to convert your original file into a readable file.

filename fixed temp;
data _null_;
   infile original dsd truncover lrecl=32767;
   file fixed lrecl=1000000;
   input line :$char32767. ;
   if notdigit(scan(line,1,','))
      and length(scan(line,1,','))>12 then put '|' @;
   else put;
   len=lengthn(line);
   put line $varying32767. len @;
run;

Now you can read the file.  You could even use PROC IMPORT to read the file, although the column headers do not make good variable names and PROC IMPORT will make its normal mistakes like treating the first column as a number instead of string.

proc import datafile=fixed dbms=csv out=want replace ;
run;

image.png

View solution in original post

17 REPLIES 17
Tom
Super User Tom
Super User

Hard to tell from a photograph of your text, but in general embedded end of line characters in a field is a problem for SAS.  There are some work arounds. 

The easiest is if the characters in your field are not exactly the same as those used to mark the real end of lines.   So if the characters are either CR or LF and the end of line if CR and LF then you can just add TERMSTR=CRLF to your INFILE statement.

 

Otherwise you will either need to make a more complex solution.

For example if the fields that have end of line characters in them are quoted you can use that fact to find them and change them into some other character.  For safety I would do this by making a copy of the file.

filename fixed temp;
data _null_;
   infile "XXX\tweet_archives_fr.csv";
   file fixed;
   input;
   q+countc(_infile_,'"');
   q=mod(q,2);
   if _n_>1 and not q then put;
   put _infile_ @;
run;
....
infile fixed dsd dlm=',' truncover firstobs=2;
SASdevAnneMarie
Barite | Level 11

The data is attached.

Thank you

Tom
Super User Tom
Super User

I don't really see a pattern in that file.  It does appear to have way too many quotes. 

Here is summary and dump of the first 20 records.

 

Is it possible to get the data from before someone tried to create a CSV file out of it?

144   data _null_;
145     infile "&path/&fname" obs=20;
146     input;
147     q=countc(_infile_,'"');
148     len=lengthn(_infile_);
149     put (_n_ q len) (=) ;
150   run;

NOTE: The infile "C:\Downloads/tweet_archives - Marie.csv" is:
      Filename=C:\Downloads\tweet_archives - Marie.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=12275,
      Last Modified=10Jul2020:11:26:23,
      Create Time=10Jul2020:11:26:22

_N_=1 q=158 len=1132
_N_=2 q=8 len=195
_N_=3 q=0 len=61
_N_=4 q=152 len=336
_N_=5 q=8 len=247
_N_=6 q=152 len=364
_N_=7 q=158 len=616
_N_=8 q=8 len=168
_N_=9 q=0 len=29
_N_=10 q=0 len=50
_N_=11 q=0 len=39
_N_=12 q=152 len=387
_N_=13 q=166 len=651
_N_=14 q=166 len=565
_N_=15 q=158 len=680
_N_=16 q=8 len=253
_N_=17 q=152 len=335
_N_=18 q=166 len=573
_N_=19 q=8 len=216
_N_=20 q=152 len=333
NOTE: 20 records were read from the infile "C:\Downloads/tweet_archives - Marie.csv".
      The minimum record length was 29.
      The maximum record length was 1132.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


151   data _null_;
152     infile "&path/&fname" obs=20;
153     input;
154     list;
155   run;

NOTE: The infile "C:\Downloads/tweet_archives - Marie.csv" is:
      Filename=C:\Downloads\tweet_archives - Marie.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=12275,
      Last Modified=10Jul2020:11:26:23,
      Create Time=10Jul2020:11:26:22

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--
1         "Identifiant du Tweet,""Permalien du Tweet"",""Texte du Tweet"",""heure"",""impressions"",""engag
      98  ements"",""taux d’engagement"",""Retweets"",""réponses"",""J'aime"",""clics sur le profil de l
     195  'utilisateur"",""clics sur l'URL"",""clics sur le hashtag"",""ouvertures des détails"",""clics s
     292  ur le permalien"",""ouvertures d'application"",""installations de l'application"",""abonnements "
     389  ",""envoyer le Tweet par email"",""composer le numéro"",""vues du média"",""engagements avec le
     486   média"",""impressions sponsorisé"",""engagements sponsorisé"",""taux d’engagement sponsoris
     583  é"",""Retweets sponsorisé"",""réponses sponsorisé"",""J'aime sponsorisé"",""clics sur le pro
     680  fil de l'utilisateur sponsorisé"",""clics sur l'URL sponsorisé"",""clics sur le hashtag sponsor
     777  isé"",""ouvertures des détails sponsorisé"",""clics sur le permalien sponsorisé"",""ouverture
     874  s d'application sponsorisé"",""installations de l'application sponsorisé"",""abonnements  spons
     971  orisé"",""envoyer le Tweet par email sponsorisé"",""composer le numéro sponsorisé"",""vues du
    1068   média sponsorisé"",""engagements avec le média sponsorisé""" 1132
2         "1266244129173893121,""https://twitter.com/ACPPLMP/status/1266244129173893121"",""💰🌂Option
      98  Capital Protégé Décembre 2024 III : 🔛 J-8 avant la fin de la commercialisation de cette OCP
     195  " 195
3          #UCprotégée #investissement #assurancevie #diversification 61
4         "https://t.co/dbBiwVNaYw https://t.co/lCbVXrlG3o"",""2020-05-29 05:45 +0000"",""608.0"",""12.0"",
      98  ""0.019736842105263157"",""2.0"",""0.0"",""2.0"",""0.0"",""1.0"",""0.0"",""6.0"",""0.0"",""0"",""
     195  0"",""0"",""0"",""0"",""1"",""1"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-
     292  "",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 336
5         "1265912925774131200,""https://twitter.com/ACPPLMP/status/1265912925774131200"",""💡@phcrevel d
      98  u @twittcercle nous éclaire dans ce dossier  consacré à « L’épargne des Français en temps
     195   de crise » réalisé avec le support d'@amphitea_ " 247
6         "👉 #Epargne #AssuranceVie https://t.co/Q8oNOJtxFP https://t.co/yP6VAdOdya"",""2020-05-28 07:48
      98   +0000"",""820.0"",""19.0"",""0.023170731707317073"",""2.0"",""0.0"",""4.0"",""0.0"",""6.0"",""0.
     195  0"",""3.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""4"",""4"",""-"",""-"",""-"",""-"",""-"",""-""
     292  ,""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 364
7         "1265573207765114880,""https://twitter.com/ACPPLMP/status/1265573207765114880"",""...Ceci, dans l
      98  'intérêt des clients et pour répondre aux demandes de nos partenaires, dans des conditions de


RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--

     195  travail spécifiques et surtout respectueuses de la sécurité pour tous.👩â€.âš•ï¸. #Covid_19
    ZONE  7767666277CA66667767267277776772767766776776726626627CA67767CA27677276772F99AE88E99EB822467665332
    NUMR  421619C030393969155305403524F54025305345553530450C10339352943900F5204F53E0F1920D2A5F8F033F694F190
     292  #télétravail #santé #déconfinement"",""2020-05-27 09:19 +0000"",""667.0"",""6.0"",""0.0089955
     389  02248875561"",""1.0"",""0.0"",""3.0"",""0.0"",""0.0"",""0.0"",""2.0"",""0.0"",""0"",""0"",""0"","
     486  "0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""
     583  -"",""-"",""-"",""-"",""-"",""-""" 616
8         "1265573206028685313,""https://twitter.com/ACPPLMP/status/1265573206028685313"",""...à tous les

      98  collaborateurs venus sur site depuis le début du confinement.ðŸ™.ðŸ™." 168
    ZONE  66666667676777276677277727676266776726626CA6772672666666666672F998F9982
    NUMR  3FCC12F2145523065E5303520394504505930C5043925404503FE69E5D5E4E0F9F0F9F2
9         Grâce à eux les opérations 29

10  CHAR  ☑ï¸. essentielles à la poursuite de l'activité 50
    ZONE  E99EB826776676666672CA26627677776762662626676767CA
    NUMR  281F8F05335E495CC530300C100F52359450450C7134969439

11  CHAR  ☑ï¸. réalisables uniquement sur site 39
    ZONE  E99EB827CA66676666727667766667277727676
    NUMR  281F8F02391C9312C5305E9155D5E4035203945
12        "ont pu être traitées.  #Covid_19 #télétravail #santé #déconfinement https://t.co/YfhUNYTK9
      98  l"",""2020-05-27 09:19 +0000"",""1167.0"",""23.0"",""0.01970865467009426"",""3.0"",""1.0"",""11.0
     195  "",""1.0"",""0.0"",""0.0"",""6.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""1"",""1"",""-"",""-"",
     292  ""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 387
13        "1265226495749959681,""https://twitter.com/ACPPLMP/status/1265226495749959681"",""""""Il faudra c
      98  ependant regarder si les épargnants ont plutôt préféré gonfler leur fonds euros, par sécuri
     195  té, ou plutôt investi sur des unités de compte ou des mandats de gestion pour profiter des opp
     292  ortunités liées à la fluctuations des marchés.""""#Unitésdecompte 💰"",""2020-05-26 10:21
     389  +0000"",""638.0"",""7.0"",""0.0109717868338558"",""1.0"",""0.0"",""2.0"",""0.0"",""0.0"",""2.0"",
     486  ""2.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""-
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--
     583  "",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 651
14        "1265226494357442561,""https://twitter.com/ACPPLMP/status/1265226494357442561"",""""""La collecte
      98   nette sur les contrats d’assurance vie devrait être abondante pour le mois d’avril 2020."""
     195  " @phcrevel @gestionfortune #assurancevie #épargne 🪙💰https://t.co/jY4nW5CS1V"",""2020-05-2
     292  6 10:21 +0000"",""725.0"",""11.0"",""0.015172413793103448"",""2.0"",""1.0"",""2.0"",""0.0"",""1.0
     389  "",""0.0"",""5.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"
     486  ",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 565
15        "1264794578751758337,""https://twitter.com/ACPPLMP/status/1264794578751758337"",""🖌 #ACPPLAMON

      98  DIALE #Story 1ï¸.⃣ Depuis plus de 30 ans, nous accompagnons nos partenaires dans la distributio
    ZONE  44444225767723EB8E8A24677672767726623326672266772666667666667266727677666676726667266266777667766
    NUMR  491C50334F2901F8F233045059300C5304503001E3C0EF530133FD017EFE30EF3001245E19253041E30C104934292549F
     195  n de solutions patrimoniales adaptées aux besoins de leurs clients. (Re) découvrez notre histoi
     292  re 👉https://t.co/sOePG4dTBs #PrendreLaMainSurDemain #AssuranceVie #Epargne https://t.co/rVuYGL
     389  ZSZv"",""2020-05-25 05:45 +0000"",""938.0"",""39.0"",""0.04157782515991471"",""5.0"",""0.0"",""6.
     486  0"",""0.0"",""5.0"",""1.0"",""21.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""1"",""1"",""-"",""-"
     583  ",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""
     680  " 680

16  CHAR  "1263007345791172608,""https://twitter.com/ACPPLMP/status/1263007345791172608"",""ðŸ.†L'Alpha Lea
    ZONE  2333333333333333333322267777322776776726662445544527767772333333333333333333322222F98842467662466
    NUMR  21263007345791172608C2284403AFF4794452E3FDF1300CD0F341453F126300734579117260822C220FF6C71C0810C51

      98  gue Table 2020 positionne ACPP LA MONDIALE Gestion d'actifs dans le ðŸ”.10 des 330 sociétés de
    ZONE  67625666623333276767666662445524424444444424677666262667667266672662F999332667233327666CA7CA72662
    NUMR  7550412C50202000F3949FEE5013000C10DFE491C5075349FE047134963041E30C500F4D100453033003F393943930450
     195  gestion intervenant sur les marchés d’actions en France" 253
17        "https://t.co/1HibRbsocs https://t.co/VR8kGoiUv8"",""2020-05-20 07:23 +0000"",""1017.0"",""7.0"",
      98  ""0.00688298918387414"",""1.0"",""0.0"",""2.0"",""0.0"",""1.0"",""0.0"",""3.0"",""0.0"",""0"",""0
     195  "",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"
     292  ",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 335
18        "1262620248580263936,""https://twitter.com/ACPPLMP/status/1262620248580263936"",""""""L'un des ef
RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--
      98  fets induits du #Covid-19 a été l'intensification spectaculaire de l'utilisation du digital dan

     195  s les entreprises."""" #TransfoNum #Digital 👨â€.💻👩â€.💻 https://t.co/mIdJFmPf5m"",""20
    ZONE  72667266776776767222222257667664762246667662F99AE88F99BF99AE88F99B2677773227266264644656362222233
    NUMR  30C5305E425029353E222203421E36FE5D03497941C00F1820D0F2B0F1920D0F2B084403AFF4E3FFD94A6D065D22C2220
     292  20-05-19 05:45 +0000"",""1255.0"",""12.0"",""0.009561752988047808"",""2.0"",""0.0"",""5.0"",""0.0
     389  "",""2.0"",""0.0"",""3.0"",""0.0"",""0"",""0"",""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""
     486  -"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-""" 573
19        "1262292543426965505,""https://twitter.com/ACPPLMP/status/1262292543426965505"",""Tout concourt Ã
      98    la digitalisation d’une profession qui continue à s’appuyer sur les compétences de ses co

     195  nseillers.👨â€.💻" 216
    ZONE  6766666772F99AE88F99B2
    NUMR  E359CC523E0F1820D0F2B2
20        " #coconstruction #digitalbydesign #formation"",""2020-05-18 08:02 +0000"",""1180.0"",""4.0"",""0
      98  .003389830508474576"",""1.0"",""0.0"",""1.0"",""0.0"",""0.0"",""0.0"",""2.0"",""0.0"",""0"",""0""
     195  ,""0"",""0"",""0"",""0"",""0"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",""-"",
     292  ""-"",""-"",""-"",""-"",""-"",""-"",""-""" 333
NOTE: 20 records were read from the infile "C:\Downloads/tweet_archives - Marie.csv".
      The minimum record length was 29.
      The maximum record length was 1132.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
SASdevAnneMarie
Barite | Level 11
Thank you Tom! I can't have the data, I must use this file 😞
SASdevAnneMarie
Barite | Level 11
Hello!
Do you know some options to import this csv file with Libname XLSX?

Thank you!
Tom
Super User Tom
Super User

A CSV file is not an XLSX file.  If they can send you an XLSX file it would be easier to deal with.

You could try seeing if Excel could do a better job of loading the file, but it does not look like a valid CSV file at all, so I doubt that Excel will have any better luck with it.  But if you could get it loaded into a XLSX file then you could read that file with the XLSX engine.

SASdevAnneMarie
Barite | Level 11
Thank you Tom!
Unfortunately I can’t have the data in xlsx format. 😞
Tom
Super User Tom
Super User

So it looks like something in your process took your original CSV (with the line breaks in it) and added quotes around every line.  Those can be removed by either using DEQUOTE() function or reading the lines using DSD option.

But some of the embedded line breaks make lines without any commas and that is making figuring out which line breaks are the real end of line difficult.

Fortunately the first value on each row is a long number, so you can use that fact to do a better job of finding where the real lines start.

So here is a data step to convert your original file into a readable file.

filename fixed temp;
data _null_;
   infile original dsd truncover lrecl=32767;
   file fixed lrecl=1000000;
   input line :$char32767. ;
   if notdigit(scan(line,1,','))
      and length(scan(line,1,','))>12 then put '|' @;
   else put;
   len=lengthn(line);
   put line $varying32767. len @;
run;

Now you can read the file.  You could even use PROC IMPORT to read the file, although the column headers do not make good variable names and PROC IMPORT will make its normal mistakes like treating the first column as a number instead of string.

proc import datafile=fixed dbms=csv out=want replace ;
run;

image.png

SASdevAnneMarie
Barite | Level 11
Thank you very much Tom!
Yes, some application add the quote to my CSV file.

SASdevAnneMarie
Barite | Level 11
Hello Tom,
I have a little question: how SAS interpret the symbol 'I' while proc import.
If I understood correctly the data are not truncated because of 'I', but I can't understand the choice of 'I'.

Thank you very much!
Tom
Super User Tom
Super User

@SASdevAnneMarie wrote:
Hello Tom,
I have a little question: how SAS interpret the symbol 'I' while proc import.
If I understood correctly the data are not truncated because of 'I', but I can't understand the choice of 'I'.

Thank you very much!

It is not there for SAS to understand, just for humans.  I choose that character to replace the end of line characters just in case there was some need to know where they used to be. I picked the pipe character as it is unlikely to be in your actual data. You could use a space or not write anything at all to replace the end of line characters that are being removed.

SASdevAnneMarie
Barite | Level 11
Thank you very much!
I understood 🙂
SASdevAnneMarie
Barite | Level 11
Sorry, I still can't understand 🙂 why using filename fixed temp (after put the data to this file), proc import works correctly. :), but it doesn't works with csv.

Thank you!
Tom
Super User Tom
Super User

The issue is you have corrupted file.  It has both extra end of line characters embedded in the middle of some of the lines and extra quotes (for some unexplained reason).  The purpose of the data step is to create a version of the file that has both issues corrected.  Once that is done then the corrected file can be read.  One way to read the file is to use PROC IMPORT.  But you could also just write your own data step to read the file.  You don't have to make a temporary file.  You could instead create a permanent file somewhere.  One simple way would be to change the FILENAME statement to point to an actual filename that you could write to instead of using the TEMP engine to generate a temporary file.

 

The data step itself if pretty simple.  It reads one line (removing the extra quotes because it is using the DSD option).  Then it decides whether it is the start of a new line of data based on the first value on that line.  When it is a new record it writes the end of line for the previous record.  Otherwise it writes a pipe character to replace the end of line that is going to be removed.  Then it writes the line (removing any trailing blanks).

 

If you have specific questions about the data step please ask.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 17 replies
  • 5441 views
  • 2 likes
  • 2 in conversation