I have troubles reading CSV files in with the following situation occurs:
<xx><delim><quote><……..><quote><delim><delim><xx>. So some ware in a recordline there is a delimiter, followed with a quote, then one or more fields, then some characters closed with a quote before the next delimiter. I tried to solve it with the DSD option but that doesn’t work. The DSD skips the delimiter between the quotes. Without DSD two delimiters behind each others are transformed to one delimiter. Double delimiters can occur because it represents a missing value for that variable. The problem is the last quote directly followed by a delimiter.
Who can help me to solve this problem?
Following below are two identical examples. The first is with DSD, the second without.
* Infile with DSD. Record 3 (KEY='C') fails because the string 's Gravenhage,Mataw' leads to one variable;
data work.OPER_A_DSD;
infile datalines delimiter = "," MISSOVER DSD;
attrib KEY length = $12;
attrib CITY length = $60;
attrib NAME length = $50;
attrib EMPTY length =$1;
attrib EX length = $1;
input KEY CITY NAME EMPTY EX $;
put _all_;
datalines;
A,Utrecht,James,,X
B,'s Gravenhage,Scott,N,X
C,'s Gravenhage,Mataw',N,X
D,'s Gravenhage,Mataw',,X
E,'s Gravenhage,Mata'w,,X
;
run;
* Infile without DSD. Record 1 (KEY='A') fails because 2 cancatenated delimitters (,,) is converted to one. The value X is placed in the variable EMPTY instead of EX;
data work.OPER_A_NODSD;
infile datalines delimiter = "," MISSOVER ;
attrib KEY length = $12;
attrib CITY length = $60;
attrib NAME length = $50;
attrib EMPTY length =$1;
attrib EX length = $1;
input KEY CITY NAME EMPTY EX $;
put _all_;
datalines;
A,Utrecht,James,,X
B,'s Gravenhage,Scott,N,X
C,'s Gravenhage,Mataw',N,X
D,'s Gravenhage,Mataw',,X
E,'s Gravenhage,Mata'w,,X
;
run;
Neither of these are valid CSV files. If a CSV file's data item contains a quote, then the whole column should be enclosed in quotes (preferably double to avoid single quotes):
A,"Utrecht,James",,X
B,"'s Gravenhage,Scott",N,X
C,"'s Gravenhage,Mataw'",N,X
D,"'s Gravenhage,Mataw'",,X
E,"'s Gravenhage,Mata'w",,X
Fix the export of data would be my suggestion. Otherwise you would need to do some parsing on the file yourself.
Thanks RW9 but I do not have the possibility to change the export to a 'valid' CSV. It is delivered by an external party. What is a 'valid' CSV file. I don't think there is a formal definition regulating the form of a CSV-file. Therefore there are a lot of variations in CSV-files. Parsing is an alternative but that's the last rescue I want to use. SAS can read 'everything' so I hope there is a better solution.
Ronald
https://en.wikipedia.org/wiki/Comma-separated_values
RFC-4180
Anyways, irrespective of a standard or not, you need to have in the data some clear way of identifying when a quote or delimiter is part of a variable or is a quote or delimiter.
In this instance:
A,Utrecht,James,,X
B,'s Gravenhage,Scott,N,X
C,'s Gravenhage,Mataw',N,X
D,'s Gravenhage,Mataw',,X
E,'s Gravenhage,Mata'w,,X
's Gravenhage,Scott - is the comma part of the text or a delimiter? You may be able to look at this and make some decision, however to write a generic import to cover these types of things is pretty difficult - your assumption is not going to be universal.
Hi Tom,
I see, there is some kind of standard. Next problem is how quick I can motivate the data-deliverer to change this export. That takes a lot of time because this export-format is used by a lot of customers.
'S Gravenhage (with the quote) is the name of a city in the Netherlands. The comma is the delimiter so Scott is the value of the third column.
Ronald
There is a standard, RCF-4180. https://www.ietf.org/rfc/rfc4180.txt
The only change from that you need to tell them to make is to not send any end of line markers in the value of character variables.
One possibility would be to correct the file on the fly. e.g.:
data work.OPER_A_NODSD; input @; _infile_= translate(_infile_,'~',"'"); infile datalines delimiter = "," dsd MISSOVER ; attrib KEY length = $12; attrib CITY length = $60; attrib NAME length = $50; attrib EMPTY length =$1; attrib EX length = $1; input KEY CITY NAME EMPTY EX $; city=translate(city,"'",'~'); name=translate(name,"'",'~'); put _all_; datalines; A,Utrecht,James,,X B,'s Gravenhage,Scott,N,X C,'s Gravenhage,Mataw',N,X D,'s Gravenhage,Mataw',,X E,'s Gravenhage,Mata'w,,X ; run;
Art, CEO, AnalystFinder.com
Hey Cyntia,
Your assumption is correct. Variable EMPTY shout be missing and EX the value 'X'. The first dataset with DSD reads this record correct in because there is a character between quote and the delimiter (letter w). Without DSD it goes wrong in my example because the value 'X' appears in EMPTY.
I forgot something. The single quotes are part of the delivered data and shoud not be skipped.
Did you try the solution I proposed? I think it does what you want!
Art, CEO, AnalystFinder.com
Hi Art,
Parsing every file, what you suggest,will work but is not what I want. Only as a last rescue. Reason for that is that the solution must be implemented in the File Reader transformation in SAS/DI Studio. That is possible but than you have created for every External File that you ant to read a user writen code. Within the File Reader it is not possible to parse then _infile_ lines. That is why I hope that there is an other option in f.i. the infile statement that I can use.
Ronald
If you want to process without DSD option then add a period between the adjacent commas. Here is how you could do it in the data step that reads the data.
data one ;
infile cards dlm=',' truncover ;
input @;
if _infile_=:',' then _infile_='.'||_infile_;
do while (index(_infile_,',,'));_infile_=tranwrd(_infile_,',,',',.,'); end;
input (x1-x6) (:$20.) ;
put _infile_ @40 (x1-x6) (=) ;
cards4;
A,Utrecht,James,,X
B,'s Gravenhage,Scott,N,X
C,'s Gravenhage,Mataw',N,X
D,'s Gravenhage,Mataw',,X
E,'s Gravenhage,Mata'w,,X
;;;;
If you want to use the DSD option then you need to convert those single quotes into something else so that values that start with a single quote don't cause it hunt for the closing quote. You could translate the quotes to something else and then process all of your character variables to reverse the translation. Make sure to use a character that is not in your data file.
data one ;
infile cards dsd truncover ;
input @;
_infile_=translate(_infile_,"`","'");
input (x1-x6) (:$20.) ;
array _c _character_;
do over _c; _c=translate(_c,"'","`"); end;
put _infile_ @40 (x1-x6) (=) ;
cards4;
A,Utrecht,James,,X
B,'s Gravenhage,Scott,N,X
C,'s Gravenhage,Mataw',N,X
D,'s Gravenhage,Mataw',,X
E,'s Gravenhage,Mata'w,,X
;;;;
Hi Tom,
This is a solution but difficult to implement in the File Reader in SAS/DI Studio. Most of the replies I get suggest to parse the file with the _infile_ buffer. Not ideal but maybe there is not an other solution.
Ronald
@RonaldHof wrote:
Hi Tom,
This is a solution but difficult to implement in the File Reader in SAS/DI Studio. Most of the replies I get suggest to parse the file with the _infile_ buffer. Not ideal but maybe there is not an other solution.
Ronald
Will DI studio allow you to create a code step that you could use to create a corrected version of the file that you could then read using normal process?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.