I'm running SAS 9.3 on Enterprise Guide 5.1 on Windows 7 x64. My overall goal is to Import all CSVs in a given pre-defined folder. Retrieve a handful of fields that show up in all of the CSVs (in this context, SSN, Salary, CSV Filename and a 9-char string "BenGrpCode"). Ensure the fields are structured the same (in this context, that's mostly about scrubbing SSNs in both a 9-digit numeric form w/o leading zeros and a XXX-XX-XXXX character form into something consistent with leading zeros). Append all of the data from each of the original source CSVs into one dataset. Output the appended dataset for use elsewhere. After looking through several very helpful historic questions on this site, I've figured out most of what I'm looking to do except for one piece that should (hopefully) be embarrassingly simple. I think I'm having trouble getting good IF-THEN language for dynamically doing things to a field based on whether its numeric or character. Specifically, I need to take inconsistent SSN fields in numeric formats (w/o leading zeros) and XXX-XX-XXXX character formats and create something consistent so that I can append several datasets together. My existing code is below in its entirety and a snippet of the log with the pertinent error message is below that. Unfortunately, I don't know a good way to share good example data in this specific scenario (but tips are greatly appreciated!), but hopefully I've explained enough about the structure to still be clear. %MACRO AppendCSVs ;
* Generate the list of CSV files in specified location ;
filename DIRLIST pipe 'dir "C:\Users\filepath\Data\*.csv" /b ';
data dirlist ;
infile dirlist lrecl=200 truncover;
input file_name $100.;
run;
* Generate the list of files to loop through ;
proc sql noprint;
select distinct file_name
into :mvals separated by '|'
from dirlist;
%let mdim=&sqlobs;
quit;
* Create the ultimate appended dataset for later use ;
DATA Appended;
LENGTH Filename $100.;
LENGTH SSN $9.;
LENGTH BenGrpCode $9.;
Annual_Salary = .;
RUN;
* Start looping through each file ;
/* %do _i=1 %to 3; */
%do _i=1 %to &mdim;
* Get the file name for this loop iteration ;
%let _v = %scan(&mvals,&_i,|);
* Import the CSV naively to be robust to structure changes ;
PROC IMPORT OUT= file&_i.
DATAFILE= "C:\Users\filepath\Data\&_v."
DBMS=CSV REPLACE;
GETNAMES=YES;
GUESSINGROWS=100000;
DATAROW=2;
RUN;
* Check if SSN is character or numeric ;
DATA _null_;
SET file&_i ;
CALL symputx('vtype',vtype(Employee_SSN));
stop;
RUN;
* Scrub SSN if appropriate ;
DATA file&_i;
SET file&_i;
LENGTH SSN $9.;
IF (&vtype="(C)") THEN SSN = COMPRESS(PUT(Employee_SSN,best32.),,'kd');
ELSE SSN = PUT(INPUT(Employee_SSN,best32.),z9.);
RUN;
* Add a field that equals the file name and only keep the specified fields that you need ;
DATA file&_i (KEEP=Filename Employee_SSN SSN BenGrpCode Annual_Salary);
SET file&_i;
LENGTH Filename $100.;
Filename = "&_v.";
RUN;
* Append data to ultimate dataset;
DATA Appended;
SET Appended file&_i;
RUN;
%end;
* Output the ultimate appended dataset ;
PROC EXPORT
DATA=Appended
OUTFILE="C:\Users\filepath\Output\Appended_Dataset.csv"
DBMS=CSV
LABEL
REPLACE
;
RUN;
%MEND;
%AppendCSVs; While that's the entire code, I'm reasonably confident that my error is somewhere in this snippet where I attempt to do something different to a particular field depending on whether it's numeric or character: * Check if SSN is character or numeric ;
DATA _null_;
SET file&_i ;
CALL symputx('vtype',vtype(Employee_SSN));
stop;
RUN;
* Scrub SSN if appropriate ;
DATA file&_i;
SET file&_i;
LENGTH SSN $9.;
IF (&vtype="(C)") THEN SSN = COMPRESS(PUT(Employee_SSN,best32.),,'kd');
ELSE SSN = PUT(INPUT(Employee_SSN,best32.),z9.);
RUN; When that code runs for a dataset whose SSN is stored in as a character viable (e.g. XXX-XX-XXXX), then I get the following error: NOTE: WORK.FILE3 data set was successfully created.
NOTE: The data set WORK.FILE3 has 3951 observations and 285 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 14.70 seconds
cpu time 14.53 seconds
NOTE: There were 1 observations read from the data set WORK.FILE3.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
(C)
484: LINE and COLUMN cannot be determined.
NOTE: NOSPOOL is on. Rerunning with OPTION SPOOL might allow recovery of the LINE and COLUMN where the error has occurred.
NOTE 484-185: Format $BEST was not found or could not be loaded.
NOTE: Variable C is uninitialized.
NOTE: Invalid argument to function INPUT at line 10640 column 177. It appears that vtype variable is equaling "(C)" or something to that effect (but changing my code to test against "(C)" was not successful). Also, any general tips are appreciated. I have a hunch that there might be a cleverer way to accomplish my overall goals, particularly with the way that I'm PROC IMPORTing the entire datasets and then appending a relatively small portion of them together one-by-one. Thanks in advance!
... View more