Solved: How to import a CSV file that includes long strings, which contain com...

xyxu · Posted 06-06-2020 04:02 PM

The CSV file has many columns, and 2 of these columns' values are text (short paragraphs). I am using

delimiter = ','

It seems SAS recognizes any comma inside the paragraphs as a delimiter and reports an error. Is there a good way to properly import such files?

Patrick · Posted 06-06-2020 06:11 PM

You have a line feed (LF) in your data. Add the following to your Infile statement and see if this helps: TERMSTR=CRLF

View solution in original post

ballardw · Posted 06-06-2020 04:23 PM

The infile option DSD should allow you to read the data if the value in the data is inside quotes.

If the values contain quotes in the body you may have issues though.

xyxu · Posted 06-06-2020 04:32 PM

I used DSD option, but still get errors when SAS goes over a comma inside the string.

Patrick · Posted 06-06-2020 04:35 PM

The strings with commas need to be quoted. If so then below two options should work (for the Infile statement DSD adds the "magic" you're after).

%let path=%sysfunc(pathname(work));
data _null_;
  file "&path/text.csv";
  put 'col1,col2,col3';
  put '1,"ab, cd", xyz';
  stop;
run;

/* using proc import */
proc import 
  file="&path/text.csv"
  out=want1
  dbms=csv;
  delimiter = ",";
  getnames=yes;
  guessingrows=max;
run;

proc print data=want1;
run;

/* using data step */
data want2;
  infile "&path/text.csv" dsd truncover firstobs=2;
  informat col1 best32. col2 col3 $10.;
  input col1 col2 col3;
run;

proc print data=want2;
run;

xyxu · Posted 06-06-2020 06:05 PM

Thanks! Your solution helps me realize that the problem I am getting may be different from what I initially thought. Please see attached for a one-row example. I got errors when using the following code:

data want;
	infile "&my_path\test2.csv" dsd truncover firstobs = 2;
	informat 
		NUMBER 10.
		Current_Incremental_Terms Current_Covenants Current_Other $100.
		EBITDA $10.
		FirstLienLev TotalLev 10.
		Equity_perc percent5.2
		Revolver $10.
		Tla Concurrent_Bonds 10.
		Purpose_comment $50.;
	input NUMBER Current_Incremental_Terms	Current_Covenants Current_Other EBITDA FirstLienLev TotalLev Equity_perc Revolver Tla Concurrent_Bonds Purpose_comment;
run;

While the long strings are indeed quoted in the csv file, SAS still breaks the string between commas. Could you tell what's causing the error?

Patrick · Posted 06-06-2020 06:11 PM

You have a line feed (LF) in your data. Add the following to your Infile statement and see if this helps: TERMSTR=CRLF

xyxu · Posted 06-06-2020 06:17 PM

That works! After that, the remaining errors are due to data formats (numerical value = "N/A" when it should indeed be missing). So it should be fine now. Thank you!

Patrick · Posted 06-06-2020 06:23 PM

You probably shouldn't mark your own post as solution. That's normally not very appreciated here in the forums.

For the N/A issue and if you just want this to become a missing use the ??. Done in below code for Concurrent_Bonds as there was another issue in your data.

data want;
	infile "~/test/test2.csv" dsd truncover firstobs = 2 termstr=crlf;
	informat 
		NUMBER 10.
		Current_Incremental_Terms Current_Covenants Current_Other $100.
		EBITDA $10.
		FirstLienLev TotalLev 10.
		Equity_perc percent5.2
		Revolver $10.
		Tla Concurrent_Bonds 10.
		Purpose_comment $50.;
	input NUMBER Current_Incremental_Terms	Current_Covenants 
    Current_Other EBITDA FirstLienLev 
    TotalLev Equity_perc Revolver Tla 
    Concurrent_Bonds ?? best32.
    Purpose_comment
    ;
run;

proc print data=want;
run;

xyxu · Posted 06-06-2020 06:35 PM

That was a mis-click. Already corrected. Thanks again!

ballardw · Posted 06-06-2020 06:42 PM

@xyxu wrote:

That works! After that, the remaining errors are due to data formats (numerical value = "N/A" when it should indeed be missing). So it should be fine now. Thank you!

Depending on the actual desired informat you can create a custom informat using the Proc Format invalue statement to read the N/A "cleanly" as missing. By clean I mean that there will not be any "invalid data" messages.

Example:

proc format ;
invalue mybest (upcase)
"N/A"=.
other = best8.    /*<= informat here should be what you want for actual values*/
;
run;

data example;
   informat x mybest. y best8.;
   input x y;
datalines;
1 1
n/a n/a
N/a 3
n/A 4
345.5  16
;

The Invalue statement names the informat you are creating, the option (UPCASE) turns any text into upper case before comparing to the list of values, otherwise it is an exact comparison.

Notice in the example data I only assigned the informat to one of the variables to show that you still get the invalid data for the variable without the format but the X is not causing the messages.

Going one step further with the informats if the variables are supposed to contain only a certain range of values, such as survey question response codes of 1,2,3,4 and 5 you can validate the data as it is read. This can get tricky with continuous variables because the left side of the definitions should be explicit text values. Here's a brief example.

proc format ;
invalue lickert
'1','2','3','4','5'= f1.
' ','.'  = .
other = _error_
;
run;

data example;
   infile datalines truncover;
   informat x lickert.;
   input x ;
datalines;
1 
2
.
 
99
5
;

Note the two ways to mention "missing" for input. Also, if your actual data had a numeric code, such as 99 to indicate missing, you might want to assign missing when read as well. Example for the interested reader: modify the informat LIckert to do that.

Bonus points for looking up "special missing" and assigning a special missing value to indicate that a missing code was encountered and not just a blank column in the data.

Tom · Posted 06-07-2020 10:43 AM

Some notes on creating an INFORMAT to handle NA, N/A, null and other text to replace missing values.

You could have the N/A text converted to a special missing if you need to distinguish it from the normal missing or empty value.

To have the informat just use the normal input method you can use the keyword _SAME_ on the right of the equal sign.

To specify the nested use of another informat use square brackets, [ ], around the informat name on the right of the equal sign. There is no need to use the BEST alias for the normal numeric informat, what the documentation calls the w.d informat. If you want to use an alias it might be clearer to use F instead BEST, it is not clear what "best" would mean in terms of an informat anyway.

proc format ;
invalue na (upcase)
  'N/A','NA' = .N
  other = _same_
;
run;

data test;
  input x na32. ;
  put x= ' ' _infile_;
cards;
12
0.5
n/A
NA
.
12345
;

proc format ;
invalue na_comma (upcase)
  'N/A','NA' = .N
  other = [comma.]
;
run;

data test;
  input x na_comma32. ;
  put x= ' ' _infile_;
cards;
12
0.5
n/A
NA
.
12,345
;

ballardw · Posted 06-06-2020 06:14 PM

You can post text without it getting reformatted by the message windows by copying the line from a text editor like NotePad, WordPad or similar. Then open a code box on the forum using the </> and paste.

So your attachment looks like:

NUMBER,Current Incremental Terms,Current Covenants,Current Other,EBITDA,FirstLienLev,TotalLev,Equity %,Revolver,Tla,Concurrent Bonds,Purpose comment
2413,"Incremental facilities allow leverage to increase to 5.5x net first-lien, 7.25x secured, and 7.75x net total through unsecured debt, which could either cushion against an EBITDA decline or allow incremental debt; company has access to $165 million F&C beyond the ratio debt, and those loans would be governed by 75 bps of MFN protection, subject to a six month sunset
",,"Privately-placed $355M SLTL; 24 months of portability, to a financial sponsor with at least $1 billion of AUM  leverage is capped at 5.5x net first-lien and 7.75x net total; the other portability requirement is that the sponsor contribute 35% equity to the transaction.",,5.1,7.2,,75,,`,Fund a $113M dividend and refinance existing debt

And your actual issue is a line feed, carriage return (or both) depending on operating system and what program had the data entry job.

There is a chance, depending on which actual characters are appearing, that you could use the INFILE option TERMSTR=CRLF. If the character causing the issue in the middle of the text is a single line feed or single carriage return the option CRLF requires both a carriage return and line feed character to end the line. However if the actual data entry was done in some application that allowed both characters (use of Enter Key for example) when entering data that may not work.

How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Re: How to import a CSV file that includes long strings, which contain commas inside?

Registration is open

SAS Training: Just a Click Away