Hi,
I have received a CSV file with non quoted strings. One of the string variables in the file may contain values with a comma.
So when I load the data through Proc Import, those records will all be shifted to an additional variable.
Unfortunately the data is not aligned correctly for me to run it with fixed widths.
Has anyone figured out a way to get around this?
Thanks in advance for your help.
Unless there is an additional rule that can help to identify those fields containing comma, I don't see there is a reliable way to import data precisely. I would go back asking your data source to modify their delivery in a way that the data can be easily interpreted, either adding quotes or using special delimiters that can't be found inside your field contents.
Haikuo
I figured that was the case.
Thanks for the quick reply.
I would try to correct the input file using a Data step program before the Proc Import if it is possible to create a business rule.
For example: "only the third field has commas and there are ten fields in total".
In this case I would count the commas in each line, if more than 9 (for ten fields) then they belong to the third field so I can insert quotes arround it.
If you are not able to find a correction business rule, I'm afraid you would have to ask them to send you the file again created correctly.
CTorres
Interesting..
I can probably come up with a business rule, my data will have 15 variables, and the 10th variable is the field that may have the additional comma.
I'm lost at how I would insert quotes around the 10th field in a datastep?
Thanks,
KD
It depends on your data.
You can import all as text and then clean up afterwards, i.e. if you have 16 variables instead of expected 15 then concatenate the 10th/11th and move everything over one field.
But for absolute accuracy get a new version. If source is Excel file, it doesn't have a default for quoted delimited which is usually the issue
The following program fix the incorrect input CSV file than contains 7 fields but the 3rd one includes commas:
%let invalid=3;
%let total=7;
data _null_;
length record $200 newrecord $200;
infile datalines ;
input;
record= _infile_;
from=findc(record,',');
do i=2 to &Invalid-1;
from=findc(record,',',from+1);
end;
to=findc(record,',','B');
do i=&invalid+1 to &total-1;
to=findc(substr(record,1,to-1),',','B');
end;
newrecord=cats(substr(record,1,from),'"',substr(record,from+1,to-from-1),'"',substr(record,to));
put newrecord;
datalines;
12,125,JOHN TAYLOR,NC,48,1,XYZ
22,7897,BROWN, PETER,MA,62,2,ZZY
333,1740,PETER WHITE,NY,77.2,1,XYZ
555,45,JUAN, MARIA, LUIS,XX,18,,BBB
;
RUN;
LOG:
653 %let invalid=3;
654 %let total=7;
655
656 data _null_;
657 length record $200 newrecord $200;
658 infile datalines ;
659 input;
660 record= _infile_;
661 from=findc(record,',');
662 do i=2 to &Invalid-1;
663 from=findc(record,',',from+1);
664 end;
665 to=findc(record,',','B');
666 do i=&invalid+1 to &total-1;
667 to=findc(substr(record,1,to-1),',','B');
668 end;
669 newrecord=cats(substr(record,1,from),'"',substr(record,from+1,to-from-1),'"',substr(record,to));
670 put newrecord;
671 datalines;
12,125,"JOHN TAYLOR",NC,48,1,XYZ
22,7897,"BROWN, PETER",MA,62,2,ZZY
333,1740,"PETER WHITE",NY,77.2,1,XYZ
555,45,"JUAN, MARIA, LUIS",XX,18,,BBB
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
676 ;
677 RUN;
CTorres
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.