Hello, I want to import a CSV file using the infile and input procedure. The problem is that the column 'review' is a text surrounded by quotas like ".........." and it contains carriage returns + line breaks and contains comma (',') in the text. Is the a way to import this type of csv file?
Example of the data:
That is because the end of line markers in that file are just linefeed, which is the default for Unix files. Also that file does NOT use semicolon as the delimiters, just normal old commas.
So fix the embedded LFs and PROC IMPORT can do a reasonable jobs of guessing what is in the file.
filename old "c:\downloads\1174180_RedDeadRedemption2_edited2.csv" ;
filename new "c:\downloads\fixed.txt";
%replace_crlf(old,new);
proc import datafile=new out=want replace dbms=csv ;
run;
proc means;
run;
But I suspect that AUTHOR_STEAMID should NOT be defined a NUMERIC. You normally do not need to do arithmetic with ID values. AUTHOR_LAST_PLAYED also does not look like it should be numeric, but its values are not so large that they will cause trouble if stored as a number.
So just write your own data step to read the file. You can write something that is MUCH clearer and easier to understand than the stuff that PROC IMPORT emits.
data WORK.WANT ;
infile NEW DSD TRUNCOVER firstobs=2 ;
length
n 8
recommendationid 8
language $7
review $1248
timestamp_created 8
timestamp_updated 8
voted_up $5
votes_up 8
votes_funny 8
weighted_vote_score 8
comment_count 8
steam_purchase $4
received_for_free $5
written_during_early_access $5
author_steamid $20
author_num_games_owned 8
author_num_reviews 8
author_playtime_forever 8
author_playtime_last_two_weeks 8
author_playtime_at_review 8
author_last_played 8
;
format timestamp_created timestamp_updated author_last_played datetime19.
author_playtime_forever author_playtime_last_two_weeks author_playtime_at_review time12.
;
input n -- author_last_played;
run;
proc summary;
var _numeric_;
output out=summary ;
run;
proc print data=summary;
where _stat_ =: 'M';
id _stat_;
run;
Results
votes_ weighted_ _STAT_ _TYPE_ _FREQ_ n recommendationid timestamp_created timestamp_updated votes_up funny vote_score MIN 0 11580 0.0000 59294641.00 04DEC2009:18:07:58 23MAY2010:15:42:05 0.00 0.000 0.00000 MAX 0 11580 99.0000 77060058.00 05OCT2010:04:02:21 05OCT2010:04:02:21 3012.00 339.000 0.96853 MEAN 0 11580 49.4309 72591600.59 15JUL2010:23:53:39 28JUL2010:00:32:56 2.08 0.353 0.17544 author_ author_ author_ comment_ num_games_ author_num_ playtime_ author_playtime_ playtime_ _STAT_ count owned reviews forever last_two_weeks at_review author_last_played MIN 0.0000 0.00 1.00 0:00:05 0:00:00 0:00:05 05DEC2009:06:06:35 MAX 68.0000 9708.00 1886.00 91:16:14 5:29:45 89:16:07 07OCT2010:05:59:45 MEAN 0.0959 122.64 10.08 2:01:41 0:06:58 1:10:39 29AUG2010:03:06:36
Can you post some real data? It is depend on the pattern that start a new line .
data temp;
infile 'c:\temp\have.csv' firstobs=2 length=len ;
input have $varying8000. len;
if missing(have) then delete;
if prxmatch('/^\d{1,},\d{4}/',have) then group+1;
run;
data temp2;
do until(last.group);
set temp;
by group;
length row $ 30000;
row=cats(row,have);
end;
keep group row;
run;
data want;
set temp2;
n=scan(row,1,',','mq');
recommendationid=scan(row,2,',','mq');
language=scan(row,3,',','mq');
review=scan(row,4,',','mq');
run;
Of course, I am sorry to not have included it in the first place.
I attached the original csv file.
Another approach (which will get this in a single pass):
filename readme "c:\temp\have.txt";
data recommendataions;
length n recommendationid 8 language $20 Recommend $1000;
drop _: r;
infile readme dsd _infile_=raw truncover;
input n recommendationid language:$20. R:$1000.;
if countc(raw,'"')=2 then do;
Recommend=r;
output;
end;
else do;
_first=not (missing(n) and missing(language));
if _first then do until (_n>0);
input ;
_n=find(raw,'"');
if _n=0 then do;
R=catx(' ',r,raw);
end;
else do;
R=catx(' ',R,substr(raw,1,_n-1));
if find(r,'"') then Recommend=scan(r,1,'"');
else Recommend=R;
output;
leave;
end;
end;
end;
run;
Result:
Obs | n | recommendationid | language | Recommend |
---|---|---|---|---|
1 | 0 | 77060058 | english | Really great game! The story is the most ambitious and inventive campaign ive ever played, it's a must play for anyone who loves Guns, Horses, cowboys, or video games. The Story is emotional and rich and is worth every second. The online portion of this game has gotten better since the original release in 2018; However, there is still a lot to be desired. Rockstar is slowly updating it, but this game is worth it for the story mode alone. |
2 | 1 | 77059990 | english | Red Dead Redemption 2. Is there anything I can say that will suffice in describing this masterpiece??? RDR2 is SUCH a good game. It’s definitely my favorite game. Its price is high for good reason. The storage space it takes is worth it. RDR2 is a game that has such an immersive story line. It’s a long, but **bleep** near perfect game. This is the first game that has made me bawl like a baby. If you get this game, or if you’ve already played it, you’ll understand why I cried... That said, I highly recommend this game. Again, **bleep** near perfect. |
3 | 2 | 77058854 | english | how to eat monkey ye s |
It depends on what exact characters and character combinations are in the quoted strings.
Comma and carriage control are characters, but "line break" is not a character, it is a concept. SAS expects end of line characters to indicate the end of each line. The INFILE statement lets you use the TERMSTR= option choose between three different possible end of line character strings. It could be a single linefeed character, '0A'x, a single carriage return character, '0D'x, or a carriage return + line feed combination.
So if the characters that your file is using as the end of line marge is CR+LF and the characters embedded in the strings are only single LF or single CR then just adding TERMSTR=CRLF to your INFILE statement will let you read the file as it is.
Note it is hard to tell exactly what you have from text pasted into this forum as the forum editor could be modifying it for its purposes.
To see what is in the file look at the actual character codes. The easiest way is to tell SAS to read it as fixed length blocks and dump it to the log using the LIST statement. So to read the first 500 bytes of the file you could use something like this.
data _null_;
infile 'myfile.csv' lrecl=100 recfm=f obs=100;
input;
list;
run;
If you see that the actual end of line characters and the "line breaks" in the middle of the string are using the same actual characters then you will probably want to pre-process the file to fix that.
You could try this %replace_crlf() macro
When I try to use TERMSTR=CRLF, it does not give me any output at all.
Attached would be the original csv file.
That is because the end of line markers in that file are just linefeed, which is the default for Unix files. Also that file does NOT use semicolon as the delimiters, just normal old commas.
So fix the embedded LFs and PROC IMPORT can do a reasonable jobs of guessing what is in the file.
filename old "c:\downloads\1174180_RedDeadRedemption2_edited2.csv" ;
filename new "c:\downloads\fixed.txt";
%replace_crlf(old,new);
proc import datafile=new out=want replace dbms=csv ;
run;
proc means;
run;
But I suspect that AUTHOR_STEAMID should NOT be defined a NUMERIC. You normally do not need to do arithmetic with ID values. AUTHOR_LAST_PLAYED also does not look like it should be numeric, but its values are not so large that they will cause trouble if stored as a number.
So just write your own data step to read the file. You can write something that is MUCH clearer and easier to understand than the stuff that PROC IMPORT emits.
data WORK.WANT ;
infile NEW DSD TRUNCOVER firstobs=2 ;
length
n 8
recommendationid 8
language $7
review $1248
timestamp_created 8
timestamp_updated 8
voted_up $5
votes_up 8
votes_funny 8
weighted_vote_score 8
comment_count 8
steam_purchase $4
received_for_free $5
written_during_early_access $5
author_steamid $20
author_num_games_owned 8
author_num_reviews 8
author_playtime_forever 8
author_playtime_last_two_weeks 8
author_playtime_at_review 8
author_last_played 8
;
format timestamp_created timestamp_updated author_last_played datetime19.
author_playtime_forever author_playtime_last_two_weeks author_playtime_at_review time12.
;
input n -- author_last_played;
run;
proc summary;
var _numeric_;
output out=summary ;
run;
proc print data=summary;
where _stat_ =: 'M';
id _stat_;
run;
Results
votes_ weighted_ _STAT_ _TYPE_ _FREQ_ n recommendationid timestamp_created timestamp_updated votes_up funny vote_score MIN 0 11580 0.0000 59294641.00 04DEC2009:18:07:58 23MAY2010:15:42:05 0.00 0.000 0.00000 MAX 0 11580 99.0000 77060058.00 05OCT2010:04:02:21 05OCT2010:04:02:21 3012.00 339.000 0.96853 MEAN 0 11580 49.4309 72591600.59 15JUL2010:23:53:39 28JUL2010:00:32:56 2.08 0.353 0.17544 author_ author_ author_ comment_ num_games_ author_num_ playtime_ author_playtime_ playtime_ _STAT_ count owned reviews forever last_two_weeks at_review author_last_played MIN 0.0000 0.00 1.00 0:00:05 0:00:00 0:00:05 05DEC2009:06:06:35 MAX 68.0000 9708.00 1886.00 91:16:14 5:29:45 89:16:07 07OCT2010:05:59:45 MEAN 0.0959 122.64 10.08 2:01:41 0:06:58 1:10:39 29AUG2010:03:06:36
That is the default replacement character that the %replace_crfl() macro used for the embedded linefeed characters.
You can specify something else when you call the macro. Read the source code for the macro to get more help.
%macro replace_crlf
/*----------------------------------------------------------------------------
Replace carriage return or linefeed characters that are inside quotes
----------------------------------------------------------------------------*/
(infile /* Fileref or quoted physical name of input file */
,outfile /* Fileref or quoted physical name of output file */
,cr='\r' /* Replacement string for carriage return */
,lf='\n' /* Replacement string for linefeed */
);
@Tom is there a way to run that macro on multiple files, I have this issue with lots of files we get everyday, is there a way to use this macro to loop thru all the files in the directory and apply the cleanup macro.
@jimbobob wrote:
@Tom is there a way to run that macro on multiple files, I have this issue with lots of files we get everyday, is there a way to use this macro to loop thru all the files in the directory and apply the cleanup macro.
The easiest way is to make a dataset with the names of the files and use that to generate a separate macro call for each file.
There are plenty of postings on the site (and even some examples in the SAS manuals, see DREAD() function) for how to make the list.
data _null_;
set list ;
call execute(cats('%nrstr(replace_crlf)(',quote(trim(filename)),',',quote(cats(filename,'.fixed')),')'));
run;
@Tom ,
It seems that your macro ONLY can handle the '0D0A' in the quote,but what if they are not in quote(like the following picture) ?
I run your macro and get ERROR. but mine don't.
Yes.
If the line breaks are not in quotes deciding whether or not there are extra line breaks is an impossible task.
You can make it work if you make other assumptions about the file. Things like, the delimiter will never appear as part of the data. The extra line breaks will appear in only specific field(s).
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.