12-24-2014 09:42 AM
I have received a CSV file with non quoted strings. One of the string variables in the file may contain values with a comma.
So when I load the data through Proc Import, those records will all be shifted to an additional variable.
Unfortunately the data is not aligned correctly for me to run it with fixed widths.
Has anyone figured out a way to get around this?
Thanks in advance for your help.
12-24-2014 10:51 AM
Unless there is an additional rule that can help to identify those fields containing comma, I don't see there is a reliable way to import data precisely. I would go back asking your data source to modify their delivery in a way that the data can be easily interpreted, either adding quotes or using special delimiters that can't be found inside your field contents.
12-24-2014 11:23 AM
I would try to correct the input file using a Data step program before the Proc Import if it is possible to create a business rule.
For example: "only the third field has commas and there are ten fields in total".
In this case I would count the commas in each line, if more than 9 (for ten fields) then they belong to the third field so I can insert quotes arround it.
If you are not able to find a correction business rule, I'm afraid you would have to ask them to send you the file again created correctly.
12-24-2014 11:45 AM
I can probably come up with a business rule, my data will have 15 variables, and the 10th variable is the field that may have the additional comma.
I'm lost at how I would insert quotes around the 10th field in a datastep?
12-24-2014 12:06 PM
It depends on your data.
You can import all as text and then clean up afterwards, i.e. if you have 16 variables instead of expected 15 then concatenate the 10th/11th and move everything over one field.
But for absolute accuracy get a new version. If source is Excel file, it doesn't have a default for quoted delimited which is usually the issue
12-24-2014 01:22 PM
The following program fix the incorrect input CSV file than contains 7 fields but the 3rd one includes commas:
length record $200 newrecord $200;
infile datalines ;
do i=2 to &Invalid-1;
do i=&invalid+1 to &total-1;
555,45,JUAN, MARIA, LUIS,XX,18,,BBB
653 %let invalid=3;
654 %let total=7;
656 data _null_;
657 length record $200 newrecord $200;
658 infile datalines ;
660 record= _infile_;
662 do i=2 to &Invalid-1;
666 do i=&invalid+1 to &total-1;
670 put newrecord;
555,45,"JUAN, MARIA, LUIS",XX,18,,BBB
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds