BookmarkSubscribeRSS Feed
rgdfwfw
Calcite | Level 5

I work in an organization where we are given occupational analysis data for each career field in our organization. The file is sent to us in a CSV file with comma delimiters. These datasets are large - up to 7,000 observations and 3400+ variables.

 

I created a SAS script that reads the datasets in as is and then saves it as a SAS database to serve as both an archive and master database. I then have a step that drops all columns with a certain prefix. This is preferrable because while variable names across different files are not always the exact same, their prefixes are. I end up keeping between 80-120 variables for our purposes. I have avoided using the INFILE step because my data are mixed and because variable names are not always consistent, and they are not always in the same order.

 

The SAS script I created works wonderful for datasets with CSV files that don't have any spaces in the first row (variable name). But I have recently run into some files with empty spaces in the first row. The error I am getting seems to lie in the GETNAMES option.

 

When GETNAMES=NO, SAS successfully reads in all (in this case) 3168 variables (the correct number). The obvious limitation is now my data has no variable names so I can't drop the data I need to drop:

 

proc import datafile="&path\xxxxx.csv" out=xxxxx.xxxxx
    dbms=csv replace;
    getnames=no;
    datarow=2;
run;

 

When GETNAMES=YES, SAS gets angry:

 

proc import datafile="&path\xxxxx.csv" out=xxxxx.xxxxx
    dbms=csv replace;
    getnames=yes;
    datarow=2;
run;

 

Despite me setting DBMS=csv, when it encounters two consecutive commas (e.g., empty space) it interprets the second comma as the name of a numeric variable. Then it drops successive "comma variables"  and then, in one instance, defines a comma as a character variable and then specifies an invalid variable width:

 

1882              informat xxxxx best32. ;
1883              informat ","N best32. ;
1884              informat ","N best32. ;
1885              informat ","N best32. ;
1886              informat xxxxx best32. ;
NOTE: SCL source line.
1887              informat ","N $36. ;
                                ____
                                499
WARNING: Variable ','n has already been defined as numeric.
ERROR 499-185: Width specified for informat  is invalid.

1888              informat xxxxx best32. ;
1889              informat VAR1850 $14. ;
1890              informat VAR1851 $17. ;

 

Output further down assignment formats:

 

5050              format xxxxx best12. ;
5051              format ","N best12. ;
5052              format ","N best12. ;
5053              format ","N best12. ;
5054              format xxxxx best12. ;
NOTE: SCL source line.
5055              format ","N $36. ;
                              ____
                              29
WARNING: Variable ','n has already been defined as numeric.
ERROR 29-185: Width specified for format  is invalid.

5056              format xxxxx best12. ;
5057              format VAR1850 $14. ;
5058              format VAR1851 $17. ;

 

So, in addition to not correctly recognizing consecutive commas as a blank space but as a variable, it also drops successive "comma variables" and then creates specifies an invalid character format width. The result is that all "comma variables" except the first one labeled as numeric are dropped, reducing my final variable number from 3,168 (correct) to 3,165 (wrong) and the informat/format width error stops any output from occuring. However, for some reason, all columns missing a variable name after the last column with a variable name correctly relabels itself as VARXXXX.

 

FYI I have removed variable names and replaced them with xxxxx. The data I am working with is rather sensitive and variable names are descriptive.

8 REPLIES 8
rgdfwfw
Calcite | Level 5
Update: In the original file I did a find and replace for the first row in excel, finding cells with no contents ("") and replacing them with a space (" "), and the second proc import statement now works.

So I suppose the newest question becomes is it possible to make the second statement work without having to do the find and replace step?
Tom
Super User Tom
Super User

If you opened the CSV file in Excel then you potentially changed the content.  Excel has a nasty habit of converting strings to numbers and dates.

Tom
Super User Tom
Super User

Why are you using PROC IMPORT?  If you know what data you are getting then just write your own data step to read the file.

Do you not know what variables are in the file?  

Do you not know which are the numeric and which are the character variables?

Do you not know the proper length to set for the character variables?

Do you not know which ones require a special INFORMAT or FORMAT (like date or time values)?  

rgdfwfw
Calcite | Level 5

Hello! Thanks for the fast reply.

 


@Tom wrote:

1. Why are you using PROC IMPORT?  If you know what data you are getting then just write your own data step to read the file.

2. Do you not know what variables are in the file?  

3. Do you not know which are the numeric and which are the character variables?

4. Do you not know the proper length to set for the character variables?

5. Do you not know which ones require a special INFORMAT or FORMAT (like date or time values)?  


1. I am using PROC IMPORT because I want to create a master database of all variables, and there are 3,500 of them. So I want to, at least initially, import all of them - I am generally satisfied with allowing SAS to assign format and informat.

 

I avoided writing a data step because variable names sometime differ. CSV file #1 may have "SSN," #2 has "SSAN," and CSV #3 may have "ParticipantID." However, item level data in these files start with an identital prefix like "WorkTask_" or "ToolsUsed_" so it is immensely easier to drop all variables with a consistent prefix than use a data step with variable names that change.

 

2. I (generally) know what variables are in the file, and I know the prefixes of the variables I want to drop. The goal is to create a SAS script able to process several of these CSV datasets. The number and exact name of all variables I want to keep are not the same across multiple CSV files.

 

3. Yes. However I am satisifed with PROC IMPORT making that determination on this step.

 

4. Maximum length for character variables is sometimes subject to change, but generally less than 50 characters.

 

5. Yes, but am okay with PROC IMPORT doing the assignment.

 

In my follow up it seems that the issue is there are no spaces between commas in the variable name row. When I do a find and replace and insert a space in "empty" variable name cells in excel my PROC IMPORT step runs perfectly. However I am unsure why SAS is having a hard time understanding that two consecutive commas, space or not, equals an "empty" cell.

 

FreelanceReinh
Jade | Level 19

Hello @rgdfwfw,

 

Have you tried changing the value of SAS system option VALIDVARNAME from ANY to V7?

options validvarname=v7;

This would prevent SAS from accepting commas as variable names. It would transform other column headers into valid variable names, though.

 

If you had to insist on VALIDVARNAME=ANY, you could insert the blanks into the first line in a preliminary read/write data step.

andizhang
Calcite | Level 5
Very helpful. this reset the system when getnames doesn't work
Tom
Super User Tom
Super User

Not sure why you have DATAROW=2 on both versions, that doesn't look right.  Isn't the first row the variable names?

If you really have 3,400+ columns then you might be exceeding the limits of PROC IMPORT.  It does not support line lengths longer than 32,767.  So if your average column name is 9 characters then the names plus the commas for 3,400 columns not work.

 

Read the data without the titles and let SAS assign variable names.

Read the title line as data and use it to generate rename statment.

 

Tom
Super User Tom
Super User

Here is code to read the names from the first row and generate and apply a RENAME command.

 

proc import datafile=csv out=newfile replace dbms=dlm ;
  delimiter=',';
  getnames=no;
  datarow=2;
run;
filename code temp;
data _null_;
  infile csv dsd obs=1 length=len column=col truncover;
  file code ;
  put 'RENAME';
  length  name $50 old $32 new $80;
  do i=1 by 1 until( col > len);
     input name @ ;
     old = cats('VAR',i);
     new=nliteral(trim(coalescec(name,old)));
     put old '=' new  ;
  end;
  put ';';
  stop;
run;

proc datasets lib=work nolist ;
  modify newfile;
%inc code ;
run;
quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 15442 views
  • 3 likes
  • 4 in conversation