The CSV file has many columns, and 2 of these columns' values are text (short paragraphs). I am using
delimiter = ','
It seems SAS recognizes any comma inside the paragraphs as a delimiter and reports an error. Is there a good way to properly import such files?
You have a line feed (LF) in your data. Add the following to your Infile statement and see if this helps: TERMSTR=CRLF
The infile option DSD should allow you to read the data if the value in the data is inside quotes.
If the values contain quotes in the body you may have issues though.
The strings with commas need to be quoted. If so then below two options should work (for the Infile statement DSD adds the "magic" you're after).
%let path=%sysfunc(pathname(work));
data _null_;
file "&path/text.csv";
put 'col1,col2,col3';
put '1,"ab, cd", xyz';
stop;
run;
/* using proc import */
proc import
file="&path/text.csv"
out=want1
dbms=csv;
delimiter = ",";
getnames=yes;
guessingrows=max;
run;
proc print data=want1;
run;
/* using data step */
data want2;
infile "&path/text.csv" dsd truncover firstobs=2;
informat col1 best32. col2 col3 $10.;
input col1 col2 col3;
run;
proc print data=want2;
run;
Thanks! Your solution helps me realize that the problem I am getting may be different from what I initially thought. Please see attached for a one-row example. I got errors when using the following code:
data want;
infile "&my_path\test2.csv" dsd truncover firstobs = 2;
informat
NUMBER 10.
Current_Incremental_Terms Current_Covenants Current_Other $100.
EBITDA $10.
FirstLienLev TotalLev 10.
Equity_perc percent5.2
Revolver $10.
Tla Concurrent_Bonds 10.
Purpose_comment $50.;
input NUMBER Current_Incremental_Terms Current_Covenants Current_Other EBITDA FirstLienLev TotalLev Equity_perc Revolver Tla Concurrent_Bonds Purpose_comment;
run;
While the long strings are indeed quoted in the csv file, SAS still breaks the string between commas. Could you tell what's causing the error?
You have a line feed (LF) in your data. Add the following to your Infile statement and see if this helps: TERMSTR=CRLF
That works! After that, the remaining errors are due to data formats (numerical value = "N/A" when it should indeed be missing). So it should be fine now. Thank you!
You probably shouldn't mark your own post as solution. That's normally not very appreciated here in the forums.
For the N/A issue and if you just want this to become a missing use the ??. Done in below code for Concurrent_Bonds
as there was another issue in your data.
data want;
infile "~/test/test2.csv" dsd truncover firstobs = 2 termstr=crlf;
informat
NUMBER 10.
Current_Incremental_Terms Current_Covenants Current_Other $100.
EBITDA $10.
FirstLienLev TotalLev 10.
Equity_perc percent5.2
Revolver $10.
Tla Concurrent_Bonds 10.
Purpose_comment $50.;
input NUMBER Current_Incremental_Terms Current_Covenants
Current_Other EBITDA FirstLienLev
TotalLev Equity_perc Revolver Tla
Concurrent_Bonds ?? best32.
Purpose_comment
;
run;
proc print data=want;
run;
That was a mis-click. Already corrected. Thanks again!
@xyxu wrote:
That works! After that, the remaining errors are due to data formats (numerical value = "N/A" when it should indeed be missing). So it should be fine now. Thank you!
Depending on the actual desired informat you can create a custom informat using the Proc Format invalue statement to read the N/A "cleanly" as missing. By clean I mean that there will not be any "invalid data" messages.
Example:
proc format ; invalue mybest (upcase) "N/A"=. other = best8. /*<= informat here should be what you want for actual values*/ ; run; data example; informat x mybest. y best8.; input x y; datalines; 1 1 n/a n/a N/a 3 n/A 4 345.5 16 ;
The Invalue statement names the informat you are creating, the option (UPCASE) turns any text into upper case before comparing to the list of values, otherwise it is an exact comparison.
Notice in the example data I only assigned the informat to one of the variables to show that you still get the invalid data for the variable without the format but the X is not causing the messages.
Going one step further with the informats if the variables are supposed to contain only a certain range of values, such as survey question response codes of 1,2,3,4 and 5 you can validate the data as it is read. This can get tricky with continuous variables because the left side of the definitions should be explicit text values. Here's a brief example.
proc format ; invalue lickert '1','2','3','4','5'= f1. ' ','.' = . other = _error_ ; run; data example; infile datalines truncover; informat x lickert.; input x ; datalines; 1 2 . 99 5 ;
Note the two ways to mention "missing" for input. Also, if your actual data had a numeric code, such as 99 to indicate missing, you might want to assign missing when read as well. Example for the interested reader: modify the informat LIckert to do that.
Bonus points for looking up "special missing" and assigning a special missing value to indicate that a missing code was encountered and not just a blank column in the data.
Some notes on creating an INFORMAT to handle NA, N/A, null and other text to replace missing values.
You could have the N/A text converted to a special missing if you need to distinguish it from the normal missing or empty value.
To have the informat just use the normal input method you can use the keyword _SAME_ on the right of the equal sign.
To specify the nested use of another informat use square brackets, [ ], around the informat name on the right of the equal sign. There is no need to use the BEST alias for the normal numeric informat, what the documentation calls the w.d informat. If you want to use an alias it might be clearer to use F instead BEST, it is not clear what "best" would mean in terms of an informat anyway.
proc format ;
invalue na (upcase)
'N/A','NA' = .N
other = _same_
;
run;
data test;
input x na32. ;
put x= ' ' _infile_;
cards;
12
0.5
n/A
NA
.
12345
;
proc format ;
invalue na_comma (upcase)
'N/A','NA' = .N
other = [comma.]
;
run;
data test;
input x na_comma32. ;
put x= ' ' _infile_;
cards;
12
0.5
n/A
NA
.
12,345
;
You can post text without it getting reformatted by the message windows by copying the line from a text editor like NotePad, WordPad or similar. Then open a code box on the forum using the </> and paste.
So your attachment looks like:
NUMBER,Current Incremental Terms,Current Covenants,Current Other,EBITDA,FirstLienLev,TotalLev,Equity %,Revolver,Tla,Concurrent Bonds,Purpose comment 2413,"Incremental facilities allow leverage to increase to 5.5x net first-lien, 7.25x secured, and 7.75x net total through unsecured debt, which could either cushion against an EBITDA decline or allow incremental debt; company has access to $165 million F&C beyond the ratio debt, and those loans would be governed by 75 bps of MFN protection, subject to a six month sunset ",,"Privately-placed $355M SLTL; 24 months of portability, to a financial sponsor with at least $1 billion of AUM leverage is capped at 5.5x net first-lien and 7.75x net total; the other portability requirement is that the sponsor contribute 35% equity to the transaction.",,5.1,7.2,,75,,`,Fund a $113M dividend and refinance existing debt
And your actual issue is a line feed, carriage return (or both) depending on operating system and what program had the data entry job.
There is a chance, depending on which actual characters are appearing, that you could use the INFILE option TERMSTR=CRLF. If the character causing the issue in the middle of the text is a single line feed or single carriage return the option CRLF requires both a carriage return and line feed character to end the line. However if the actual data entry was done in some application that allowed both characters (use of Enter Key for example) when entering data that may not work.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.