Importing a csv file with commas within variable.

Liz_Perkin · Posted 01-18-2019 10:18 AM

Hi,

I'm importing data (I've set up a macro to import all files from a certain location) using infile.

There is 1 troublesome variable - "Life Assured Name" which very occasionally also has a comma in it. For example "Mary Jane, Amy" which I need to be read into 1 cell but all of the data shifts along as the comma is set as my delimeter.

In the csv file the data does not contain quotation marks, only commas.

Here is my code:								DATA dsn&cnt;
									LENGTH

										'Promotion Code'n   8
										'Promotion Name'n $ 35
										'Plan Name'n     $ 3
										'Commission Rate'n   8
										Channel          $ 1
										'Policy Number'n   8
										'Life Assured Name'n $ 45
										'Life Assured DoB'n   8
										'Effective Date of Assurance'n   8
										Premium            8
										'Initial Commission'n   8
										Clawback           8
										'Renewal Commission'n 8
										'Total Commission'n   8
										'Effective Date'n   8
										'Third Party Reference'n 8
										'Settlement Reference'n $ 26
										FNAME $200.;
									FORMAT
										'Promotion Code'n BEST5.
										'Promotion Name'n $CHAR35.
										'Plan Name'n     $CHAR3.
										'Commission Rate'n BEST3.
										Channel          $CHAR1.
										'Policy Number'n BEST7.
										'Life Assured Name'n $CHAR45.
										'Life Assured DoB'n DDMMYY10.
										'Effective Date of Assurance'n DDMMYY10.
										Premium          BEST5.
										'Initial Commission'n BEST8.
										Clawback         BEST8.
										'Renewal Commission'n BEST5.
										'Total Commission'n BEST8.
										'Effective Date'n DDMMYY10.
										'Third Party Reference'n BEST6.
										'Settlement Reference'n $CHAR26.;
									INFORMAT
										'Promotion Code'n BEST5.
										'Promotion Name'n $CHAR35.
										'Plan Name'n     $CHAR3.
										'Commission Rate'n BEST3.
										Channel          $CHAR1.
										'Policy Number'n BEST7.
										'Life Assured Name'n $CHAR45.
										'Life Assured DoB'n DDMMYY10.
										'Effective Date of Assurance'n DDMMYY10.
										Premium          BEST5.
										'Initial Commission'n BEST8.
										Clawback         BEST8.
										'Renewal Commission'n BEST5.
										'Total Commission'n BEST8.
										'Effective Date'n DDMMYY10.
										'Third Party Reference'n BEST6.
										'Settlement Reference'n $CHAR26.;
									INFILE "&dir\%qsysfunc(dread(&did,&i))"
										LRECL=32767
										FIRSTOBS=6
										ENCODING="WLATIN1"
										DLM='2c'x
										MISSOVER
										DSD 
										filename=fname;
									INPUT
										'Promotion Code'n : ?? BEST5.
										'Promotion Name'n : $CHAR35.
										'Plan Name'n     : $CHAR3.
										'Commission Rate'n : ?? BEST3.
										Channel          : $CHAR1.
										'Policy Number'n : ?? BEST7.
										'Life Assured Name'n : $CHAR45.
										'Life Assured DoB'n : ?? DDMMYY10.
										'Effective Date of Assurance'n : ?? DDMMYY10.
										Premium          : ?? COMMA5.
										'Initial Commission'n : ?? COMMA7.
										Clawback         : ?? COMMA6.
										'Renewal Commission'n : BEST5.
										'Total Commission'n : ?? COMMA7.
										'Effective Date'n : ?? DDMMYY10.
										'Third Party Reference'n : BEST6.
										'Settlement Reference'n : $CHAR25.;
									Filename = Fname;
									File = &cnt;
									ordering = _n_;
								run;

I've tried using termstr with no luck, I'm not sure how to solve this.

Any help would be great.

Thansk, Liz

Kurt_Bremser · Posted 01-18-2019 10:22 AM

If the strings that contain delimiters are not enclosed in quotes (because then the dsd option would take care of that), return the faulty data to sender and request it be corrected.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Liz_Perkin · Posted 01-18-2019 10:26 AM

Hey,

Unfortunately this isn't an option as I'm importing many files from a system generated process.

😞

Kurt_Bremser · Posted 01-18-2019 10:34 AM

Then that process needs to be fixed. Either change the delimiter, or have the values enclosed in quotes.

It might be that you can make up a clear working rule for identifying the "non-delimiting" commas, but the resulting code would end up quite convoluted and ugly, as opposed to a simple data step where correct input data is read (as you need to parse the input line yourself, instead of having that done by the input statement).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Reeza · Posted 01-18-2019 11:15 AM

Take your file, take the first 10 lines and replace all that data with fake data and share the structure.

But @Kurt_Bremser is right. If your system is creating a file with comma's not masked with quotations you have no way to ensure your data is read correctly. You may consider counting the number of comma's and just dealing with the ones that have more than X, but what happens if a record is missing a variable and another variable has an embedded comma. Or if more than one comma is embedded.

It looks like you're using Enterprise Guide which is generating weird code IMO.

If you have the documentation for your data set instead, write a manual data step and see if that works correctly. That would be my first attempt at a fix.

If you don't know how share your specifications and someone can help with that. Or at least show you how it needs to be written.

@Liz_Perkin wrote:

Hey,

Unfortunately this isn't an option as I'm importing many files from a system generated process.

😞

ChrisNZ · Posted 01-19-2019 06:15 AM

Since the data does no obey basic structure guidelines (such as differentiate between delimiter and data), you are in guessing mode.

I would follow @Reeza's recommendation and count the commas in the _INFILE_ buffer.

From there you can choose which of 2 INPUT statements to use, and hope for the best regarding your assumption about what the commas mean.

You can of course make the analysis of the input buffer as complex as you want if counting commas does not cover all cases of how feral the data is.

High-Performance SAS Coding - Third Edition

Patrick · Posted 01-19-2019 05:12 AM

@Liz_Perkin

As @Reeza points out this is primarily a logical problem and not a SAS problem.

If you can formulate the logic required to identify if a comma is a delimiter or part of the string to be read then I'm sure the community here can assist you with translating such logic into SAS code.

Ideally: Post representative sample data so we understand what you're dealing with.

novinosrin · Posted 01-18-2019 10:25 AM

Can you share the csv file(if that doesn't violate any non disclosure data security policy or something of that kind) with the variables you want to read. Somebody here may write the code for you

ErikLund_Jensen · Posted 01-18-2019 11:32 AM

Hi Liz_Perkin

If you have a comma-separated record without quotes, I don't think that it is possible to get SAS to understand that one of the delimiters shold be ignored. But a possible solution could be to preprocess the files and change that delimiter to sometiing else. It could always be changed back after the record is split into variables.

I made the following test. It depends on all records being complete, i.e. there is always the same number of fields and therefor the same number of delimiters, if one does not count the extra delimiter in the "Life Assured Name" field. But in that case it works.

* get some test records in a file;
Data _null_;
	file 'c:\tmp\test.txt' lrecl=80;
	put '1,22,Hans Olsen,Denmark,333,4';
	put '1111,2,Turner, Alfred,England,3333,4';
	put '1,222,Horst Mayer,Germany,3,4444';
run;

* Read the file as a delimited file; 
data test; infile 'c:\tmp\test.txt' dsd dlm=',' missover;
	length v1 v2 8 v3 v4 $40 v5 v6 8;
	input
		'V1'n : ?? BEST5.
		'V2'n : ?? BEST5.
		'V3'n : $CHAR40.
		'V4'n : $CHAR40.
		'V5'n : ?? BEST5.
		'V6'n : ?? BEST5.;
run;

* Read the file and write another file.
* If 6 delimiters and not 5, change the third to #;
data test2; 
	infile 'c:\tmp\test.txt' lrecl=80 truncover;
	file 'c:\tmp\test2.txt' lrecl=80;
	length rec $80;
	drop pos len;
	input rec $char80.;
	if count(rec,',') = 6 then do;
		call scan(rec,4,pos,len,',');
		substr(rec,pos-1,1) = '#';
	end;
	put rec;
run;

* Read the new file as a delimited file; 
data test2; infile 'c:\tmp\test2.txt' dsd dlm=',' missover;
	length v1 v2 8 v3 v4 $40 v5 v6 8;
	input
		'V1'n : ?? BEST5.
		'V2'n : ?? BEST5.
		'V3'n : $CHAR40.
		'V4'n : $CHAR40.
		'V5'n : ?? BEST5.
		'V6'n : ?? BEST5.;
run;

The first read:

The second read:

Liz_Perkin · Posted 01-21-2019 03:24 AM

Hi,

Firstly thank you everyone for your help on this.

I've done a few things that seem to have all bases covered.

Used the infile/scan/countc functions to detect and re-adjust for an extra "," (I love this solution - thank you)

A load of checking to ensure that key fields are fully populated at the end of the process hence picking up any records that have shifted across

Working with suppliers to understand/amend input data

Annoyingly it's only 2 records (out of 9,500) that this has occurred on, but as it's Finance calculations that sit off the back end I can't get it wrong.

Thanks again - and enjoy your week!

Liz

ErikLund_Jensen · Posted 01-18-2019 12:04 PM

Hi Liz_Perkin (continued)

Instead of writing an intermidiate file, it would be nicer and more efficient to drop reading the file as a delimited file in an extra step and do the whole in one pass with scan functions. Notice that I changed the separator count to countc() and not count() in order to cope with missing values. The following gives the same result as test2 above + the file name.

* Read the file and use scan functions to split in variables;
* If 6 delimiters and not 5, change the third to # before splitting;
data test3; 
	length v1 v2 8 v3 v4 $40 v5 v6 8 filename fname $100;
	infile 'c:\tmp\test.txt' lrecl=80 truncover filename=fname;
	drop pos len;
	input;
	if countc(_infile_,',') = 6 then do;
		call scan(_infile_,4,pos,len,',');
		substr(_infile_,pos-1,1) = '#';
	end;

	v1 = input(scan(_infile_,1,','),??BEST5.);
	v2 = input(scan(_infile_,2,','),??BEST5.);
	v3 = scan(_infile_,3,',');
	v4 = scan(_infile_,4,',');
	v5 = input(scan(_infile_,5,','),??BEST5.);
	v6 = input(scan(_infile_,6,','),??BEST5.);
	filename = fname;
run;

Tom · Posted 01-18-2019 02:07 PM

@ErikLund_Jensen wrote:

Hi Liz_Perkin (continued)

Instead of writing an intermidiate file, it would be nicer and more efficient to drop reading the file as a delimited file in an extra step and do the whole in one pass with scan functions. Notice that I changed the separator count to countc() and not count() in order to cope with missing values. The following gives the same result as test2 above + the file name.
* Read the file and use scan functions to split in variables;
* If 6 delimiters and not 5, change the third to # before splitting;
data test3; 
	length v1 v2 8 v3 v4 $40 v5 v6 8 filename fname $100;
	infile 'c:\tmp\test.txt' lrecl=80 truncover filename=fname;
	drop pos len;
	input;
	if countc(_infile_,',') = 6 then do;
		call scan(_infile_,4,pos,len,',');
		substr(_infile_,pos-1,1) = '#';
	end;

	v1 = input(scan(_infile_,1,','),??BEST5.);
	v2 = input(scan(_infile_,2,','),??BEST5.);
	v3 = scan(_infile_,3,',');
	v4 = scan(_infile_,4,',');
	v5 = input(scan(_infile_,5,','),??BEST5.);
	v6 = input(scan(_infile_,6,','),??BEST5.);
	filename = fname;
run; 

Why do you need the SCAN() functions if you have already modified the line?

input @ ;
...
substr(_infile_,pos-1,1) = '#';
...
input v1-v6 ;

ErikLund_Jensen · Posted 01-19-2019 04:30 AM

@Tom

That's true, the scan function is not necessary and probably less effecient too. So why? - In the last example, my focus was not on writing the smartest code, but only to show that it could be done in one step. I chose the technique out of habit, because I have used this approach i many programs, where a little more is done to the variables, like

Vn = propcase(prxchange('s/(\w+), (\w+)/$2 $1/,-1,scan(_infile_,n)))

It happens often to me that I just grab my preferred hammer out of the toolbox, if the problem is simple and looks like something I have done before. so thanks for reminding me that there is always a better way, if you just sit down and think before you code.

Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

Re: Importing a csv file with commas within variable.

SAS Innovate 2025: Call for Content

Classroom Training Available!