I work in an organization where we are given occupational analysis data for each career field in our organization. The file is sent to us in a CSV file with comma delimiters. These datasets are large - up to 7,000 observations and 3400+ variables. I created a SAS script that reads the datasets in as is and then saves it as a SAS database to serve as both an archive and master database. I then have a step that drops all columns with a certain prefix. This is preferrable because while variable names across different files are not always the exact same, their prefixes are. I end up keeping between 80-120 variables for our purposes. I have avoided using the INFILE step because my data are mixed and because variable names are not always consistent, and they are not always in the same order. The SAS script I created works wonderful for datasets with CSV files that don't have any spaces in the first row (variable name). But I have recently run into some files with empty spaces in the first row. The error I am getting seems to lie in the GETNAMES option. When GETNAMES=NO, SAS successfully reads in all (in this case) 3168 variables (the correct number). The obvious limitation is now my data has no variable names so I can't drop the data I need to drop: proc import datafile="&path\xxxxx.csv" out=xxxxx.xxxxx dbms=csv replace; getnames=no; datarow=2; run; When GETNAMES=YES, SAS gets angry: proc import datafile="&path\xxxxx.csv" out=xxxxx.xxxxx dbms=csv replace; getnames=yes; datarow=2; run; Despite me setting DBMS=csv, when it encounters two consecutive commas (e.g., empty space) it interprets the second comma as the name of a numeric variable. Then it drops successive "comma variables" and then, in one instance, defines a comma as a character variable and then specifies an invalid variable width: 1882 informat xxxxx best32. ; 1883 informat ","N best32. ; 1884 informat ","N best32. ; 1885 informat ","N best32. ; 1886 informat xxxxx best32. ; NOTE: SCL source line. 1887 informat ","N $36. ; ____ 499 WARNING: Variable ','n has already been defined as numeric. ERROR 499-185: Width specified for informat is invalid. 1888 informat xxxxx best32. ; 1889 informat VAR1850 $14. ; 1890 informat VAR1851 $17. ; Output further down assignment formats: 5050 format xxxxx best12. ; 5051 format ","N best12. ; 5052 format ","N best12. ; 5053 format ","N best12. ; 5054 format xxxxx best12. ; NOTE: SCL source line. 5055 format ","N $36. ; ____ 29 WARNING: Variable ','n has already been defined as numeric. ERROR 29-185: Width specified for format is invalid. 5056 format xxxxx best12. ; 5057 format VAR1850 $14. ; 5058 format VAR1851 $17. ; So, in addition to not correctly recognizing consecutive commas as a blank space but as a variable, it also drops successive "comma variables" and then creates specifies an invalid character format width. The result is that all "comma variables" except the first one labeled as numeric are dropped, reducing my final variable number from 3,168 (correct) to 3,165 (wrong) and the informat/format width error stops any output from occuring. However, for some reason, all columns missing a variable name after the last column with a variable name correctly relabels itself as VARXXXX. FYI I have removed variable names and replaced them with xxxxx. The data I am working with is rather sensitive and variable names are descriptive.
... View more