Solved: How to Dynamically Differentiate Between Numeric & Character Field in ...

ImSpartacus · Posted 01-16-2018 08:29 AM

I'm running SAS 9.3 on Enterprise Guide 5.1 on Windows 7 x64.

My overall goal is to

Import all CSVs in a given pre-defined folder.
Retrieve a handful of fields that show up in all of the CSVs (in this context, SSN, Salary, CSV Filename and a 9-char string "BenGrpCode").
- Ensure the fields are structured the same (in this context, that's mostly about scrubbing SSNs in both a 9-digit numeric form w/o leading zeros and a XXX-XX-XXXX character form into something consistent with leading zeros).
Append all of the data from each of the original source CSVs into one dataset.
Output the appended dataset for use elsewhere.

After looking through several very helpful historic questions on this site, I've figured out most of what I'm looking to do except for one piece that should (hopefully) be embarrassingly simple.

I think I'm having trouble getting good IF-THEN language for dynamically doing things to a field based on whether its numeric or character. Specifically, I need to take inconsistent SSN fields in numeric formats (w/o leading zeros) and XXX-XX-XXXX character formats and create something consistent so that I can append several datasets together.

My existing code is below in its entirety and a snippet of the log with the pertinent error message is below that. Unfortunately, I don't know a good way to share good example data in this specific scenario (but tips are greatly appreciated!), but hopefully I've explained enough about the structure to still be clear.

%MACRO AppendCSVs ;

* Generate the list of CSV files in specified location ;

filename DIRLIST pipe 'dir "C:\Users\filepath\Data\*.csv" /b ';

data dirlist ;
   infile dirlist lrecl=200 truncover;
   input file_name $100.;
run;

* Generate the list of files to loop through ;

proc sql noprint;
   select distinct file_name 
    into :mvals separated by '|'
    from dirlist;
    %let mdim=&sqlobs;
quit;

* Create the ultimate appended dataset for later use ;

DATA Appended;

	LENGTH Filename $100.;
	LENGTH SSN $9.;
	LENGTH BenGrpCode $9.;
	Annual_Salary = .;

RUN;

* Start looping through each file ; 

/* %do _i=1 %to 3; */

%do _i=1 %to &mdim;

	* Get the file name for this loop iteration ;

	%let _v = %scan(&mvals,&_i,|);

	* Import the CSV naively to be robust to structure changes ; 

	PROC IMPORT OUT= file&_i.
        		DATAFILE= "C:\Users\filepath\Data\&_v."
        DBMS=CSV REPLACE;
     	GETNAMES=YES;
     	GUESSINGROWS=100000;
     	DATAROW=2;
	RUN;

	* Check if SSN is character or numeric ;

	DATA _null_;
		SET file&_i ;
		CALL symputx('vtype',vtype(Employee_SSN));
 		stop;
	RUN;

	* Scrub SSN if appropriate ; 

	DATA file&_i;
		SET file&_i;

		LENGTH SSN $9.;

		IF (&vtype="(C)") THEN SSN = COMPRESS(PUT(Employee_SSN,best32.),,'kd');
			ELSE SSN = PUT(INPUT(Employee_SSN,best32.),z9.);
	RUN;

	* Add a field that equals the file name and only keep the specified fields that you need ;

	DATA file&_i (KEEP=Filename Employee_SSN SSN BenGrpCode Annual_Salary); 
		SET file&_i;

		LENGTH Filename $100.;
		Filename = "&_v.";

	RUN;

	* Append data to ultimate dataset; 

	DATA Appended;
		SET Appended file&_i;
	RUN;

%end;

* Output the ultimate appended dataset ;

PROC EXPORT
	DATA=Appended
	OUTFILE="C:\Users\filepath\Output\Appended_Dataset.csv"
	DBMS=CSV
	LABEL
	REPLACE
	;
RUN;

%MEND;

%AppendCSVs;

While that's the entire code, I'm reasonably confident that my error is somewhere in this snippet where I attempt to do something different to a particular field depending on whether it's numeric or character:

	* Check if SSN is character or numeric ;

	DATA _null_;
		SET file&_i ;
		CALL symputx('vtype',vtype(Employee_SSN));
 		stop;
	RUN;

	* Scrub SSN if appropriate ; 

	DATA file&_i;
		SET file&_i;

		LENGTH SSN $9.;

		IF (&vtype="(C)") THEN SSN = COMPRESS(PUT(Employee_SSN,best32.),,'kd');
			ELSE SSN = PUT(INPUT(Employee_SSN,best32.),z9.);
	RUN;

When that code runs for a dataset whose SSN is stored in as a character viable (e.g. XXX-XX-XXXX), then I get the following error:

NOTE: WORK.FILE3 data set was successfully created.
NOTE: The data set WORK.FILE3 has 3951 observations and 285 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
      real time           14.70 seconds
      cpu time            14.53 seconds
      


NOTE: There were 1 observations read from the data set WORK.FILE3.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

(C)
484: LINE and COLUMN cannot be determined.
NOTE: NOSPOOL is on. Rerunning with OPTION SPOOL might allow recovery of the LINE and COLUMN where the error has occurred.
NOTE 484-185: Format $BEST was not found or could not be loaded.

NOTE: Variable C is uninitialized.
NOTE: Invalid argument to function INPUT at line 10640 column 177.

It appears that vtype variable is equaling "(C)" or something to that effect (but changing my code to test against "(C)" was not successful).

Also, any general tips are appreciated. I have a hunch that there might be a cleverer way to accomplish my overall goals, particularly with the way that I'm PROC IMPORTing the entire datasets and then appending a relatively small portion of them together one-by-one.

Thanks in advance!

Shmuel · Posted 01-16-2018 10:12 AM

Your code:

		IF (&vtype="(C)") THEN SSN = COMPRESS(PUT(Employee_SSN,best32.),,'kd');
			ELSE SSN = PUT(INPUT(Employee_SSN,best32.),z9.);

Your error message:

484: LINE and COLUMN cannot be determined.
NOTE: NOSPOOL is on. Rerunning with OPTION SPOOL might allow recovery of the LINE and COLUMN where the error has occurred.
NOTE 484-185: Format $BEST was not found or could not be loaded.

NOTE: Variable C is uninitialized.

1) Using INPUT function assumes that the input variable is char type, but you already know that it is numeric, therefore

you got the Format $BEST message.

2) Using OPTION SPOOL will help show you the line in log where the error occurs

3) Variable C is uninitialized as the code should be:

IF ("&vtype"="C") ...

View solution in original post

RW9 · Posted 01-16-2018 09:34 AM

Sorry, not even reading most of that. Two things jumped out at me, in reverse order:

"I have a hunch that there might be a cleverer way to accomplish my overall goals"

and

The rest of the post.

The best, repeatable, simple and transparent process is to first document the data. This is far more important than coding or any of that. Write down in a document what the data structure will be, formats, lengths, informats, encoding, value lists, and any other considerations. Include delivery timepoints, contact details etc. Then have it signed off by both parties. This document will then guide you to write a simple basic data step import program to import all the files in one go, following the metadata provided in the documentation. This is the best way to go about it, it provides clarity across the board, it shows the process clearly and transparently, and above all it will always be the same unless a change is requested (in which case the document would be the first updated). As an example of simplifying your code, if all the csv's are read in using the same structure, then a wildcard in the filename will read them all - removes all the macro looping, endless dataset creation and concatenation etc.

The other approach, which you detail is this.

Take a bunch of files, of which you know very little and little to no control over. Write a mass of code which imports each ones, does some guessing based on what is read in (not necessarily correctly). Then attempts to write import datasteps for each of the files based on what its guessed. This process may work, it may even work twice, but it will break (just like Excel + proc import), and then you will spend all that time going back trying to figure it out. Also the process is not transparent, but hidden behind guessing logic.

So which one covers your hunch?

utrocketeng · Posted 01-16-2018 09:38 AM

Greetings, i am not sure if this will help, but i used a solution like this in the past:

DATA thing;
set thing2;
if Vtype('SSN'n) = 'N' then CD = 'SSN'n;
else if Vtype('SSN'n) = 'C' then CD = input('SSN'n,Best32.);
RUN;

note the input vs put when the datatype was identified as 'C'.

you may want to do some static import prior to looping through all of your files. good luck.

Shmuel · Posted 01-16-2018 10:12 AM

Your code:

		IF (&vtype="(C)") THEN SSN = COMPRESS(PUT(Employee_SSN,best32.),,'kd');
			ELSE SSN = PUT(INPUT(Employee_SSN,best32.),z9.);

Your error message:

484: LINE and COLUMN cannot be determined.
NOTE: NOSPOOL is on. Rerunning with OPTION SPOOL might allow recovery of the LINE and COLUMN where the error has occurred.
NOTE 484-185: Format $BEST was not found or could not be loaded.

NOTE: Variable C is uninitialized.

1) Using INPUT function assumes that the input variable is char type, but you already know that it is numeric, therefore

you got the Format $BEST message.

2) Using OPTION SPOOL will help show you the line in log where the error occurs

3) Variable C is uninitialized as the code should be:

IF ("&vtype"="C") ...

ImSpartacus · Posted 01-16-2018 10:34 AM

Thanks for pointing out that the macro variable needed to be inside quotes or else the data step would think it's a variable in the dataset.

Holy cow, that error ended up about as embarrassing as I expected!

I greatly appreciate your assistance.

ballardw · Posted 01-16-2018 10:31 AM

A few critical questions:

1) Do you know if the files, or groups of files, have the same layout/content preferably documented as to layout and variable type (numeric or character), content (length and/or format) and description?

2) If yes to the first question, is there a rule or pattern identifying which files have which pattern?

3) are more files in the given format going to be added to the location for reading?

If yes to 1 and 2 then write a single data step to read one of the layouts. When you read one file successfully then change the infile and out data set name. Optionally append the new data to the old if your processes make that desirable. Repeat as needed. (at this point automating is possible using your DIRLIST pipe IF you select all files of a given layout.

Then as new data is added you should easily be able to identify new files for reading.

I really hesitate to write any code to guess if a numeric appearing value should be numeric or character. What if you guess wrong and later need to do arithmetic with a character variable? Or you assume something is numeric such as an account or part number that later challenges numeric storage precision?

ImSpartacus · Posted 01-16-2018 10:57 AM

Thanks for the tips.

Concerning the source datasets, they have dozens of fields and they do vary slightly.

However, the handful of fields that I actually need are mercifully consistent across all files except for one small exception: SSN.

On some files, SSN is stored with dashes (XXX-XX-XXXX) and on some, it's stored without dashes (XXXXXXXXX).

I assumed that might be a simple enough thing to overcome, but it turned out to be slightly more complex than I expected.

ballardw · Posted 01-16-2018 03:23 PM

@ImSpartacus wrote:
Thanks for the tips.

Concerning the source datasets, they have dozens of fields and they do vary slightly.

However, the handful of fields that I actually need are mercifully consistent across all files except for one small exception: SSN.

On some files, SSN is stored with dashes (XXX-XX-XXXX) and on some, it's stored without dashes (XXXXXXXXX).

I assumed that might be a simple enough thing to overcome, but it turned out to be slightly more complex than I expected.

If the only difference then a read program would read SSN as character using the longer version. To make all the data similar it would be easy to either remove the dashes (compress function) or substring and concatenate results with dashes (catx).

data junk;
   input ssn $ 1-11;
   if length(ssn)=11 then nodash=compress(ssn,'-');
   if length(ssn)=9 then adddash=catx('-',substr(ssn,1,3),substr(ssn,4,2),substr(ssn,6));
datalines;
123-45-6789
123456789  
;
run;

How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Logic

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Re: How to Dynamically Differentiate Between Numeric & Character Field in Base SAS Data Step Log

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away