Re: Import CSV with spaces in variable names using DATA step

viola · Posted 08-30-2018 05:59 PM

I am having trouble importing a CSV file using the DATA step where some of the column names have spaces. I have specified that a comma is the delimiter, but SAS is still breaking up the variable names into multiple columns. I am aware that using PROC IMPORT would solve this problem, but for various reasons, is not an option in my case. What am I missing?

data &&sasname&i;
	infile "&&path&i" delimiter="," dsd firstobs=2 truncover;
	input &varlist; 
run;

My CSV looks like the following:

Visit Repeat Key,Form Repeat Key,Item Group Repeat Key,Scheduled Date,Completed Date
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T18:11:17.000Z
1,1,,2018-08-02T07:00:00.000Z,2018-08-02T23:14:02.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:11:16.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:00:21.000Z
1,1,,2018-08-21T07:00:00.000Z,
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T19:59:46.000Z
1,1,,2018-08-02T07:00:00.000Z,2018-08-02T16:58:20.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:12:31.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:53:10.000Z

Reeza · Posted 08-30-2018 06:05 PM

If you're using a data step you would specify the variable list in your INPUT statement.

If you're trying to do this dynamically you could either consider NLITERAL() function or somehow add an underscore before hand.

How is &varlist being created?

@viola wrote:

I am having trouble importing a CSV file using the DATA step where some of the column names have spaces. I have specified that a comma is the delimiter, but SAS is still breaking up the variable names into multiple columns. I am aware that using PROC IMPORT would solve this problem, but for various reasons, is not an option in my case. What am I missing?
data &&sasname&i;
	infile "&&path&i" delimiter="," dsd firstobs=2 truncover;
	input &varlist; 
run;
My CSV looks like the following:

Visit Repeat Key,Form Repeat Key,Item Group Repeat Key,Scheduled Date,Completed Date
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T18:11:17.000Z
1,1,,2018-08-02T07:00:00.000Z,2018-08-02T23:14:02.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:11:16.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:00:21.000Z
1,1,,2018-08-21T07:00:00.000Z,
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T19:59:46.000Z
1,1,,2018-08-02T07:00:00.000Z,2018-08-02T16:58:20.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:12:31.000Z
1,1,,2018-08-01T07:00:00.000Z,2018-08-02T17:53:10.000Z

viola · Posted 08-30-2018 06:38 PM

The varlist is being created in a previous data _null_ step using the header row. Code does need to be dynamic.

data _null_;
infile "&&path&i" obs=1 delimiter=",";
input;
call symputx('varlist',tranwrd(compress(_infile_,'"'),',',' '));
run;

Reeza · Posted 08-30-2018 06:47 PM

You can either compress out the spaces in the variable names, which is probably a good idea, Or you can use NLITERAL like I mentioned to wrap the variable with quotes and add the N at the end to give you a valid SAS name.

If you're compressing them out, you likely want to use COUNTW to count the number of commas which will indicate the number of variables. Then use SCAN() to parse it out and apply the COMPRESS() or NLITERAL function as desired.

@viola wrote:
The varlist is being created in a previous data _null_ step using the header row. Code does need to be dynamic.

data _null_;
infile "&&path&i" obs=1 delimiter=",";
input;
call symputx('varlist',tranwrd(compress(_infile_,'"'),',',' '));
run;

viola · Posted 08-31-2018 06:28 PM

@Tom, in this scenario, this creates the varlist - but then when I go to read the file in, how do I make it so that the header row is read in again with no spaces?

Tom · Posted 08-31-2018 07:58 PM

@viola wrote:
@Tom, in this scenario, this creates the varlist - but then when I go to read the file in, how do I make it so that the header row is read in again with no spaces?

I don't understand the question. The code you posted was using FIRSTOBS=2 to skip the header row. So it doesn't matter whether the header row has spaces or not since it is being skipped.

viola · Posted 09-05-2018 01:13 PM

Please excuse my brain...trying to think through this on a Friday afternoon was not happening. You are right, of course.

Tom · Posted 08-30-2018 10:22 PM

SAS variable names must be 32 characters or less and can use only letter, digits and underscores. They cannot start with a number.

You could try just turning the spaces into underscores.

data _null_;
  infile "&&path&i" obs=1 ;
  input;
  call symputx('varlist',translate(trim(_infile_),' _',', '));
run;

If you set the option VALIDVARNAME=ANY then you can have spaces and other characters, but then you need to use name literals.

data _null_;
 infile "&&path&i" dsd obs=1 eof=save ;
 length varlist $30000 name $32 ;
 input name @@ ;
 varlist =catx(' ',varlist,nliteral(name));
 retain varlist;
return;
save:
call symputx('varlist',varlist);
run;

Kurt_Bremser · Posted 08-31-2018 03:26 AM

I strongly recommend @Tom's suggestion #1, as working without named literals is much easier in future code based on theis dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 08-31-2018 09:45 AM

Why not ?

options validvarname=any;

proc import datafile=..............

viola · Posted 08-31-2018 06:26 PM

@Ksharp, not an option here. It would take me too long to explain why but I need to use a data step (see my original post).

ballardw · Posted 08-31-2018 06:54 PM

@viola wrote:
@Ksharp, not an option here. It would take me too long to explain why but I need to use a data step (see my original post).

I routinely use Proc import to build a skeleton of a data step to read files. When you use proc import for a delimited file then the LOG will contain the data step code generated by SAS to read the file. Best is make sure you use a large value for the GUESSINGROWS option to get reasonable results.

Copy that code from the log into the editor and go to town.

Search and replace variable names as needed, adjust lengths, add labels.

If you remove the generated statements such as

303   /**********************************************************************
304   *   PRODUCT:   SAS
305   *   VERSION:   9.4
306   *   CREATOR:   External File Interface
307   *   DATE:      31AUG18
308   *   DESC:      Generated SAS Datastep Code
309   *   TEMPLATE SOURCE:  (None Specified.)
310   ***********************************************************************/

and the ones referenciging _EFFIERR_ no one else may ever know that proc import was used.

This will generate variable names with _ instead of imbedded spaces or special characters or leading numeric characters if your headers have such. If you have variables that have the same column header for more than 32 characters you may get some VARx.

If you have multiple column heading with the same text that are shorter such as "TOTAL" you will get different variables such as TOTAL2 or TOTAL127 depending on a number of factors I've never spent a lot of time worrying about. I generally make more descriptive names for the such as "Total_income" Total_hours" or what ever.

SAS Innovate 2025: Call for Content

Classroom Training Available!