BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
simon_98
Fluorite | Level 6

Hello, I want to import a CSV file using the infile and input procedure. The problem is that the column 'review' is a text surrounded by quotas like ".........." and it contains carriage returns + line breaks and contains comma (',') in the text. Is the a way to import this type of csv file?

 

 

Example of the data:

 

n,recommendationid,language,review
 
0,77060058,english,"Really great game!
 
The story is the most ambitious and inventive campaign ive ever played, it's a must play for anyone who loves Guns, Horses, cowboys, or video games. The Story is emotional and rich and is worth every second. The online portion of this game has gotten better since the original release in 2018;
 
However, there is still a lot to be desired. Rockstar is slowly updating it, but this game is worth it for the story mode alone.",1601870541,1601870541,True,0,0,0,0,True,False,False,76561198262327743,33,9,2737,2737,2737,1601868447
 
 
1,77059990,english,"Red Dead Redemption 2. Is there anything I can say that will suffice in describing this masterpiece??? RDR2 is SUCH a good game. It’s definitely my favorite game. Its price is high for good reason. The storage space it takes is worth it. RDR2 is a game that has such an immersive story line. It’s a long, but **bleep** near perfect game. This is the first game that has made me bawl like a baby. If you get this game, or if you’ve already played it, you’ll understand why I cried... That said, I highly recommend this game. Again, **bleep** near perfect.",1601870400,1601870400,True,0,0,0,0,True,False,False,76561198303413791,48,7,5029,0,5029,1597432775
 
2,77058854,english,"how to eat monkey,

ye s
",1601868148,1601868148,True,0,0,0,0,True,False,False,76561198425061844,6,2,442,442,412,1601869925
1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

That is because the end of line markers in that file are just linefeed, which is the default for Unix files.  Also that file does NOT use semicolon as the delimiters, just normal old commas.

 

So fix the embedded LFs and PROC IMPORT can do a reasonable jobs of guessing what is in the file.

filename old "c:\downloads\1174180_RedDeadRedemption2_edited2.csv" ;
filename new "c:\downloads\fixed.txt";

%replace_crlf(old,new);

proc import datafile=new out=want replace dbms=csv ;
run;

proc means;
run;

Tom_0-1657117947302.png

But I suspect that AUTHOR_STEAMID should NOT be defined a NUMERIC.  You normally do not need to do arithmetic with ID values.  AUTHOR_LAST_PLAYED also does not look like it should be numeric, but its values are not so large that they will cause trouble if stored as a number.

 

So just write your own data step to read the file.  You can write something that is MUCH clearer and easier to understand than the stuff that PROC IMPORT emits.

data WORK.WANT    ;
  infile NEW DSD TRUNCOVER firstobs=2 ;
  length
     n 8
     recommendationid 8
     language $7
     review $1248
     timestamp_created 8
     timestamp_updated 8
     voted_up $5
     votes_up 8
     votes_funny 8
     weighted_vote_score 8
     comment_count 8
     steam_purchase $4
     received_for_free $5
     written_during_early_access $5
     author_steamid $20
     author_num_games_owned 8
     author_num_reviews 8
     author_playtime_forever 8
     author_playtime_last_two_weeks 8
     author_playtime_at_review 8
     author_last_played 8
  ;
  format timestamp_created timestamp_updated author_last_played datetime19.
         author_playtime_forever author_playtime_last_two_weeks author_playtime_at_review time12.
  ;
  input n -- author_last_played;
run;

proc summary;
  var _numeric_;
  output out=summary  ;
run;

proc print data=summary;
 where _stat_ =: 'M';
 id _stat_;
run;

Results

                                                                                                                votes_    weighted_
_STAT_   _TYPE_   _FREQ_      n      recommendationid     timestamp_created     timestamp_updated   votes_up     funny   vote_score

 MIN        0      11580    0.0000      59294641.00      04DEC2009:18:07:58    23MAY2010:15:42:05       0.00     0.000     0.00000
 MAX        0      11580   99.0000      77060058.00      05OCT2010:04:02:21    05OCT2010:04:02:21    3012.00   339.000     0.96853
 MEAN       0      11580   49.4309      72591600.59      15JUL2010:23:53:39    28JUL2010:00:32:56       2.08     0.353     0.17544

                       author_                         author_                             author_
         comment_    num_games_    author_num_       playtime_    author_playtime_       playtime_
_STAT_     count        owned        reviews           forever     last_two_weeks        at_review     author_last_played

 MIN       0.0000         0.00          1.00           0:00:05           0:00:00           0:00:05     05DEC2009:06:06:35
 MAX      68.0000      9708.00       1886.00          91:16:14           5:29:45          89:16:07     07OCT2010:05:59:45
 MEAN      0.0959       122.64         10.08           2:01:41           0:06:58           1:10:39     29AUG2010:03:06:36

View solution in original post

22 REPLIES 22
Ksharp
Super User

Can you post some real data? It is depend on the pattern that start a new line .

 

data temp;
infile 'c:\temp\have.csv' firstobs=2 length=len ;
input have $varying8000. len;
if missing(have) then delete;
if prxmatch('/^\d{1,},\d{4}/',have) then group+1;
run;
data temp2;
 do until(last.group);
  set temp;
  by group;
  length row $ 30000;
  row=cats(row,have);
 end;
keep group row;
run;
data want;
 set temp2;
n=scan(row,1,',','mq');
recommendationid=scan(row,2,',','mq');
language=scan(row,3,',','mq');
review=scan(row,4,',','mq');
run;
simon_98
Fluorite | Level 6

Of course, I am sorry to not have included it in the first place.

I attached the original csv file.

SASJedi
SAS Super FREQ

Another approach (which will get this in a single pass):

filename readme "c:\temp\have.txt";
data recommendataions;
   length n recommendationid 8 language $20 Recommend $1000;
   drop _: r;
   infile readme dsd _infile_=raw truncover;
   input n recommendationid language:$20. R:$1000.;
   if countc(raw,'"')=2 then do;
      Recommend=r;
      output;
   end;
   else do;
      _first=not (missing(n) and missing(language)); 
      if _first then do until (_n>0);
         input ;
         _n=find(raw,'"');
         if _n=0 then do;
            R=catx(' ',r,raw);
         end;
         else do;
            R=catx(' ',R,substr(raw,1,_n-1));
            if find(r,'"') then Recommend=scan(r,1,'"');
            else Recommend=R;
            output;
            leave;
         end;
      end;
   end;
run;

Result:

Obs n recommendationid language Recommend
1 0 77060058 english Really great game! The story is the most ambitious and inventive campaign ive ever played, it's a must play for anyone who loves Guns, Horses, cowboys, or video games. The Story is emotional and rich and is worth every second. The online portion of this game has gotten better since the original release in 2018; However, there is still a lot to be desired. Rockstar is slowly updating it, but this game is worth it for the story mode alone.
2 1 77059990 english Red Dead Redemption 2. Is there anything I can say that will suffice in describing this masterpiece??? RDR2 is SUCH a good game. It’s definitely my favorite game. Its price is high for good reason. The storage space it takes is worth it. RDR2 is a game that has such an immersive story line. It’s a long, but **bleep** near perfect game. This is the first game that has made me bawl like a baby. If you get this game, or if you’ve already played it, you’ll understand why I cried... That said, I highly recommend this game. Again, **bleep** near perfect.
3 2 77058854 english how to eat monkey ye s
         
Check out my Jedi SAS Tricks for SAS Users
Tom
Super User Tom
Super User

It depends on what exact characters and character combinations are in the quoted strings.

Comma and carriage control are characters, but "line break" is not a character, it is a concept.  SAS expects end of line characters to indicate the end of each line. The INFILE statement lets you use the TERMSTR= option choose between three different possible end of line character strings.  It could be a single linefeed character, '0A'x, a single carriage return character, '0D'x, or a carriage return + line feed combination.  

 

So if the characters that your file is using as the end of line marge is CR+LF and the characters embedded in the strings are only single LF or single CR then just adding TERMSTR=CRLF to your INFILE statement will let you read the file as it is.

 

Note it is hard to tell exactly what you have from text pasted into this forum as the forum editor could be modifying it for its purposes.

 

To see what is in the file look at the actual character codes. The easiest way is to tell SAS to read it as fixed length blocks and dump it to the log using the LIST statement.  So to read the first 500 bytes of the file you could use something like this.

data _null_;
  infile 'myfile.csv' lrecl=100 recfm=f obs=100;
  input;
  list;
run;

If you see that the actual end of line characters and the "line breaks" in the middle of the string are using the same actual characters then you will probably want to pre-process the file to fix that.

You could try this %replace_crlf() macro

 

simon_98
Fluorite | Level 6

When I try to use TERMSTR=CRLF, it does not give me any output at all.

Attached would be the original csv file.

Tom
Super User Tom
Super User

That is because the end of line markers in that file are just linefeed, which is the default for Unix files.  Also that file does NOT use semicolon as the delimiters, just normal old commas.

 

So fix the embedded LFs and PROC IMPORT can do a reasonable jobs of guessing what is in the file.

filename old "c:\downloads\1174180_RedDeadRedemption2_edited2.csv" ;
filename new "c:\downloads\fixed.txt";

%replace_crlf(old,new);

proc import datafile=new out=want replace dbms=csv ;
run;

proc means;
run;

Tom_0-1657117947302.png

But I suspect that AUTHOR_STEAMID should NOT be defined a NUMERIC.  You normally do not need to do arithmetic with ID values.  AUTHOR_LAST_PLAYED also does not look like it should be numeric, but its values are not so large that they will cause trouble if stored as a number.

 

So just write your own data step to read the file.  You can write something that is MUCH clearer and easier to understand than the stuff that PROC IMPORT emits.

data WORK.WANT    ;
  infile NEW DSD TRUNCOVER firstobs=2 ;
  length
     n 8
     recommendationid 8
     language $7
     review $1248
     timestamp_created 8
     timestamp_updated 8
     voted_up $5
     votes_up 8
     votes_funny 8
     weighted_vote_score 8
     comment_count 8
     steam_purchase $4
     received_for_free $5
     written_during_early_access $5
     author_steamid $20
     author_num_games_owned 8
     author_num_reviews 8
     author_playtime_forever 8
     author_playtime_last_two_weeks 8
     author_playtime_at_review 8
     author_last_played 8
  ;
  format timestamp_created timestamp_updated author_last_played datetime19.
         author_playtime_forever author_playtime_last_two_weeks author_playtime_at_review time12.
  ;
  input n -- author_last_played;
run;

proc summary;
  var _numeric_;
  output out=summary  ;
run;

proc print data=summary;
 where _stat_ =: 'M';
 id _stat_;
run;

Results

                                                                                                                votes_    weighted_
_STAT_   _TYPE_   _FREQ_      n      recommendationid     timestamp_created     timestamp_updated   votes_up     funny   vote_score

 MIN        0      11580    0.0000      59294641.00      04DEC2009:18:07:58    23MAY2010:15:42:05       0.00     0.000     0.00000
 MAX        0      11580   99.0000      77060058.00      05OCT2010:04:02:21    05OCT2010:04:02:21    3012.00   339.000     0.96853
 MEAN       0      11580   49.4309      72591600.59      15JUL2010:23:53:39    28JUL2010:00:32:56       2.08     0.353     0.17544

                       author_                         author_                             author_
         comment_    num_games_    author_num_       playtime_    author_playtime_       playtime_
_STAT_     count        owned        reviews           forever     last_two_weeks        at_review     author_last_played

 MIN       0.0000         0.00          1.00           0:00:05           0:00:00           0:00:05     05DEC2009:06:06:35
 MAX      68.0000      9708.00       1886.00          91:16:14           5:29:45          89:16:07     07OCT2010:05:59:45
 MEAN      0.0959       122.64         10.08           2:01:41           0:06:58           1:10:39     29AUG2010:03:06:36
simon_98
Fluorite | Level 6
Thank You so much, it helped me a lot and it is working now!!
Ksharp
Super User

@Tom

Why in your result , there are some '\n' chatacters but in CSV don't have ?

 

Ksharp_0-1657193863737.png

 

Tom
Super User Tom
Super User

That is the default replacement character that the %replace_crfl() macro used for the embedded linefeed characters.

You can specify something else when you call the macro.  Read the source code for the macro to get more help.

 

%macro replace_crlf
/*----------------------------------------------------------------------------
Replace carriage return or linefeed characters that are inside quotes
----------------------------------------------------------------------------*/
(infile   /* Fileref or quoted physical name of input file */
,outfile  /* Fileref or quoted physical name of output file */
,cr='\r'  /* Replacement string for carriage return */
,lf='\n'  /* Replacement string for linefeed */
);
jimbobob
Quartz | Level 8

@Tom is there a way to run that macro on multiple files, I have this issue with lots of files we get everyday, is there a way to use this macro to loop thru all the files in the directory and apply the cleanup macro. 

Tom
Super User Tom
Super User

@jimbobob wrote:

@Tom is there a way to run that macro on multiple files, I have this issue with lots of files we get everyday, is there a way to use this macro to loop thru all the files in the directory and apply the cleanup macro. 


The easiest way is to make a dataset with the names of the files and use that to generate a separate macro call for each file.

There are plenty of postings on the site (and even some examples in the SAS manuals, see DREAD() function) for how to make the list.

data _null_;
  set list ;
  call execute(cats('%nrstr(replace_crlf)(',quote(trim(filename)),',',quote(cats(filename,'.fixed')),')'));
run;
Ksharp
Super User

@Tom  ,

It seems that your macro ONLY can handle the '0D0A'  in the quote,but what if they are not in quote(like the following picture) ?

I run your macro and get ERROR. but mine don't.

 

Ksharp_0-1663071194625.pngKsharp_1-1663071253194.png

 

 

 

 

Tom
Super User Tom
Super User

Yes. 

If the line breaks are not in quotes deciding whether or not there are extra line breaks is an impossible task.

You can make it work if you make other assumptions about the file.  Things like, the delimiter will never appear as part of the data.  The extra line breaks will appear in only specific field(s).

Ksharp
Super User

Ha, @Tom 

I run your code by Unicode SAS and get no ERROR,but get wrong result !

 

Ksharp_0-1663071540298.png

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 8289 views
  • 8 likes
  • 6 in conversation