Solved: Re: Input a CSV file with text fields that contain line breaks + carri...

simon_98 · Posted 07-05-2022 10:03 AM

Hello, I want to import a CSV file using the infile and input procedure. The problem is that the column 'review' is a text surrounded by quotas like ".........." and it contains carriage returns + line breaks and contains comma (',') in the text. Is the a way to import this type of csv file?

Example of the data:

n,recommendationid,language,review

0,77060058,english,"Really great game!

The story is the most ambitious and inventive campaign ive ever played, it's a must play for anyone who loves Guns, Horses, cowboys, or video games. The Story is emotional and rich and is worth every second. The online portion of this game has gotten better since the original release in 2018;

However, there is still a lot to be desired. Rockstar is slowly updating it, but this game is worth it for the story mode alone.",1601870541,1601870541,True,0,0,0,0,True,False,False,76561198262327743,33,9,2737,2737,2737,1601868447

1,77059990,english,"Red Dead Redemption 2. Is there anything I can say that will suffice in describing this masterpiece??? RDR2 is SUCH a good game. It’s definitely my favorite game. Its price is high for good reason. The storage space it takes is worth it. RDR2 is a game that has such an immersive story line. It’s a long, but **bleep** near perfect game. This is the first game that has made me bawl like a baby. If you get this game, or if you’ve already played it, you’ll understand why I cried... That said, I highly recommend this game. Again, **bleep** near perfect.",1601870400,1601870400,True,0,0,0,0,True,False,False,76561198303413791,48,7,5029,0,5029,1597432775

2,77058854,english,"how to eat monkey,

ye s

",1601868148,1601868148,True,0,0,0,0,True,False,False,76561198425061844,6,2,442,442,412,1601869925

Tom · Posted 07-06-2022 10:32 AM

That is because the end of line markers in that file are just linefeed, which is the default for Unix files. Also that file does NOT use semicolon as the delimiters, just normal old commas.

So fix the embedded LFs and PROC IMPORT can do a reasonable jobs of guessing what is in the file.

filename old "c:\downloads\1174180_RedDeadRedemption2_edited2.csv" ;
filename new "c:\downloads\fixed.txt";

%replace_crlf(old,new);

proc import datafile=new out=want replace dbms=csv ;
run;

proc means;
run;

But I suspect that AUTHOR_STEAMID should NOT be defined a NUMERIC. You normally do not need to do arithmetic with ID values. AUTHOR_LAST_PLAYED also does not look like it should be numeric, but its values are not so large that they will cause trouble if stored as a number.

So just write your own data step to read the file. You can write something that is MUCH clearer and easier to understand than the stuff that PROC IMPORT emits.

data WORK.WANT    ;
  infile NEW DSD TRUNCOVER firstobs=2 ;
  length
     n 8
     recommendationid 8
     language $7
     review $1248
     timestamp_created 8
     timestamp_updated 8
     voted_up $5
     votes_up 8
     votes_funny 8
     weighted_vote_score 8
     comment_count 8
     steam_purchase $4
     received_for_free $5
     written_during_early_access $5
     author_steamid $20
     author_num_games_owned 8
     author_num_reviews 8
     author_playtime_forever 8
     author_playtime_last_two_weeks 8
     author_playtime_at_review 8
     author_last_played 8
  ;
  format timestamp_created timestamp_updated author_last_played datetime19.
         author_playtime_forever author_playtime_last_two_weeks author_playtime_at_review time12.
  ;
  input n -- author_last_played;
run;

proc summary;
  var _numeric_;
  output out=summary  ;
run;

proc print data=summary;
 where _stat_ =: 'M';
 id _stat_;
run;

Results

                                                                                                                votes_    weighted_
_STAT_   _TYPE_   _FREQ_      n      recommendationid     timestamp_created     timestamp_updated   votes_up     funny   vote_score

 MIN        0      11580    0.0000      59294641.00      04DEC2009:18:07:58    23MAY2010:15:42:05       0.00     0.000     0.00000
 MAX        0      11580   99.0000      77060058.00      05OCT2010:04:02:21    05OCT2010:04:02:21    3012.00   339.000     0.96853
 MEAN       0      11580   49.4309      72591600.59      15JUL2010:23:53:39    28JUL2010:00:32:56       2.08     0.353     0.17544

                       author_                         author_                             author_
         comment_    num_games_    author_num_       playtime_    author_playtime_       playtime_
_STAT_     count        owned        reviews           forever     last_two_weeks        at_review     author_last_played

 MIN       0.0000         0.00          1.00           0:00:05           0:00:00           0:00:05     05DEC2009:06:06:35
 MAX      68.0000      9708.00       1886.00          91:16:14           5:29:45          89:16:07     07OCT2010:05:59:45
 MEAN      0.0959       122.64         10.08           2:01:41           0:06:58           1:10:39     29AUG2010:03:06:36

View solution in original post

Ksharp · Posted 07-06-2022 08:25 AM

Can you post some real data? It is depend on the pattern that start a new line .

data temp;
infile 'c:\temp\have.csv' firstobs=2 length=len ;
input have $varying8000. len;
if missing(have) then delete;
if prxmatch('/^\d{1,},\d{4}/',have) then group+1;
run;
data temp2;
 do until(last.group);
  set temp;
  by group;
  length row $ 30000;
  row=cats(row,have);
 end;
keep group row;
run;
data want;
 set temp2;
n=scan(row,1,',','mq');
recommendationid=scan(row,2,',','mq');
language=scan(row,3,',','mq');
review=scan(row,4,',','mq');
run;

simon_98 · Posted 07-06-2022 10:05 AM

Of course, I am sorry to not have included it in the first place.

I attached the original csv file.

SASJedi · Posted 07-06-2022 09:02 AM

Another approach (which will get this in a single pass):

filename readme "c:\temp\have.txt";
data recommendataions;
   length n recommendationid 8 language $20 Recommend $1000;
   drop _: r;
   infile readme dsd _infile_=raw truncover;
   input n recommendationid language:$20. R:$1000.;
   if countc(raw,'"')=2 then do;
      Recommend=r;
      output;
   end;
   else do;
      _first=not (missing(n) and missing(language)); 
      if _first then do until (_n>0);
         input ;
         _n=find(raw,'"');
         if _n=0 then do;
            R=catx(' ',r,raw);
         end;
         else do;
            R=catx(' ',R,substr(raw,1,_n-1));
            if find(r,'"') then Recommend=scan(r,1,'"');
            else Recommend=R;
            output;
            leave;
         end;
      end;
   end;
run;

Result:

Obs	n	recommendationid	language	Recommend
1	0	77060058	english	Really great game! The story is the most ambitious and inventive campaign ive ever played, it's a must play for anyone who loves Guns, Horses, cowboys, or video games. The Story is emotional and rich and is worth every second. The online portion of this game has gotten better since the original release in 2018; However, there is still a lot to be desired. Rockstar is slowly updating it, but this game is worth it for the story mode alone.
2	1	77059990	english	Red Dead Redemption 2. Is there anything I can say that will suffice in describing this masterpiece??? RDR2 is SUCH a good game. It’s definitely my favorite game. Its price is high for good reason. The storage space it takes is worth it. RDR2 is a game that has such an immersive story line. It’s a long, but bleep near perfect game. This is the first game that has made me bawl like a baby. If you get this game, or if you’ve already played it, you’ll understand why I cried... That said, I highly recommend this game. Again, bleep near perfect.
3	2	77058854	english	how to eat monkey ye s

Check out my Jedi SAS Tricks for SAS Users

Tom · Posted 07-06-2022 09:17 AM

It depends on what exact characters and character combinations are in the quoted strings.

Comma and carriage control are characters, but "line break" is not a character, it is a concept. SAS expects end of line characters to indicate the end of each line. The INFILE statement lets you use the TERMSTR= option choose between three different possible end of line character strings. It could be a single linefeed character, '0A'x, a single carriage return character, '0D'x, or a carriage return + line feed combination.

So if the characters that your file is using as the end of line marge is CR+LF and the characters embedded in the strings are only single LF or single CR then just adding TERMSTR=CRLF to your INFILE statement will let you read the file as it is.

Note it is hard to tell exactly what you have from text pasted into this forum as the forum editor could be modifying it for its purposes.

To see what is in the file look at the actual character codes. The easiest way is to tell SAS to read it as fixed length blocks and dump it to the log using the LIST statement. So to read the first 500 bytes of the file you could use something like this.

data _null_;
  infile 'myfile.csv' lrecl=100 recfm=f obs=100;
  input;
  list;
run;

If you see that the actual end of line characters and the "line breaks" in the middle of the string are using the same actual characters then you will probably want to pre-process the file to fix that.

You could try this %replace_crlf() macro

simon_98 · Posted 07-06-2022 10:14 AM

When I try to use TERMSTR=CRLF, it does not give me any output at all.

Attached would be the original csv file.

Tom · Posted 07-06-2022 10:32 AM

That is because the end of line markers in that file are just linefeed, which is the default for Unix files. Also that file does NOT use semicolon as the delimiters, just normal old commas.

So fix the embedded LFs and PROC IMPORT can do a reasonable jobs of guessing what is in the file.

filename old "c:\downloads\1174180_RedDeadRedemption2_edited2.csv" ;
filename new "c:\downloads\fixed.txt";

%replace_crlf(old,new);

proc import datafile=new out=want replace dbms=csv ;
run;

proc means;
run;

But I suspect that AUTHOR_STEAMID should NOT be defined a NUMERIC. You normally do not need to do arithmetic with ID values. AUTHOR_LAST_PLAYED also does not look like it should be numeric, but its values are not so large that they will cause trouble if stored as a number.

So just write your own data step to read the file. You can write something that is MUCH clearer and easier to understand than the stuff that PROC IMPORT emits.

data WORK.WANT    ;
  infile NEW DSD TRUNCOVER firstobs=2 ;
  length
     n 8
     recommendationid 8
     language $7
     review $1248
     timestamp_created 8
     timestamp_updated 8
     voted_up $5
     votes_up 8
     votes_funny 8
     weighted_vote_score 8
     comment_count 8
     steam_purchase $4
     received_for_free $5
     written_during_early_access $5
     author_steamid $20
     author_num_games_owned 8
     author_num_reviews 8
     author_playtime_forever 8
     author_playtime_last_two_weeks 8
     author_playtime_at_review 8
     author_last_played 8
  ;
  format timestamp_created timestamp_updated author_last_played datetime19.
         author_playtime_forever author_playtime_last_two_weeks author_playtime_at_review time12.
  ;
  input n -- author_last_played;
run;

proc summary;
  var _numeric_;
  output out=summary  ;
run;

proc print data=summary;
 where _stat_ =: 'M';
 id _stat_;
run;

Results

                                                                                                                votes_    weighted_
_STAT_   _TYPE_   _FREQ_      n      recommendationid     timestamp_created     timestamp_updated   votes_up     funny   vote_score

 MIN        0      11580    0.0000      59294641.00      04DEC2009:18:07:58    23MAY2010:15:42:05       0.00     0.000     0.00000
 MAX        0      11580   99.0000      77060058.00      05OCT2010:04:02:21    05OCT2010:04:02:21    3012.00   339.000     0.96853
 MEAN       0      11580   49.4309      72591600.59      15JUL2010:23:53:39    28JUL2010:00:32:56       2.08     0.353     0.17544

                       author_                         author_                             author_
         comment_    num_games_    author_num_       playtime_    author_playtime_       playtime_
_STAT_     count        owned        reviews           forever     last_two_weeks        at_review     author_last_played

 MIN       0.0000         0.00          1.00           0:00:05           0:00:00           0:00:05     05DEC2009:06:06:35
 MAX      68.0000      9708.00       1886.00          91:16:14           5:29:45          89:16:07     07OCT2010:05:59:45
 MEAN      0.0959       122.64         10.08           2:01:41           0:06:58           1:10:39     29AUG2010:03:06:36

simon_98 · Posted 07-06-2022 04:11 PM

Thank You so much, it helped me a lot and it is working now!!

Ksharp · Posted 07-07-2022 07:37 AM

@Tom

Why in your result , there are some '\n' chatacters but in CSV don't have ?

Tom · Posted 07-07-2022 09:23 AM

That is the default replacement character that the %replace_crfl() macro used for the embedded linefeed characters.

You can specify something else when you call the macro. Read the source code for the macro to get more help.

%macro replace_crlf
/*----------------------------------------------------------------------------
Replace carriage return or linefeed characters that are inside quotes
----------------------------------------------------------------------------*/
(infile   /* Fileref or quoted physical name of input file */
,outfile  /* Fileref or quoted physical name of output file */
,cr='\r'  /* Replacement string for carriage return */
,lf='\n'  /* Replacement string for linefeed */
);

jimbobob · Posted 09-12-2022 04:46 PM

@Tom is there a way to run that macro on multiple files, I have this issue with lots of files we get everyday, is there a way to use this macro to loop thru all the files in the directory and apply the cleanup macro.

Tom · Posted 09-12-2022 07:05 PM

@jimbobob wrote:

@Tom is there a way to run that macro on multiple files, I have this issue with lots of files we get everyday, is there a way to use this macro to loop thru all the files in the directory and apply the cleanup macro.

The easiest way is to make a dataset with the names of the files and use that to generate a separate macro call for each file.

There are plenty of postings on the site (and even some examples in the SAS manuals, see DREAD() function) for how to make the list.

data _null_;
  set list ;
  call execute(cats('%nrstr(replace_crlf)(',quote(trim(filename)),',',quote(cats(filename,'.fixed')),')'));
run;

Ksharp · Posted 09-13-2022 08:14 AM

@Tom ,

It seems that your macro ONLY can handle the '0D0A' in the quote,but what if they are not in quote(like the following picture) ?

I run your macro and get ERROR. but mine don't.

Tom · Posted 09-13-2022 08:51 AM

Yes.

If the line breaks are not in quotes deciding whether or not there are extra line breaks is an impossible task.

You can make it work if you make other assumptions about the file. Things like, the delimiter will never appear as part of the data. The extra line breaks will appear in only specific field(s).

Ksharp · Posted 09-13-2022 08:18 AM

Ha, @Tom

I run your code by Unicode SAS and get no ERROR,but get wrong result !

Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Re: Input a CSV file with text fields that contain line breaks + carriage returns and ',' signs

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!