BookmarkSubscribeRSS Feed

PROC IMPORT uses n GUESSINGROWS ( default 20, can be set to whatever ) from the DATAFILE to determine if the OUT dataset columns should be numeric or character.  Many numeric account numbers ( e.g. credit cards ) are 16 digits in length which exceed the max value for an 8 byte numeric column of 9,007,199,254,740,992.  As a result the large numbers are not imported accurately.

Since the DATEFILE is being parsed to GUESSINGROWS depth in any case, the max value could be known from the rows examined.  If the max value is > 9,007,199,254,740,992 PROC IMPORT could set the data type in the column to character to ensure all digits are retained.

5 Comments
Harry
Obsidian | Level 7

Currently the user is forced to examine the max values in the numeric columns and generate the DATA STEP import code from the SAS metadata, switching the columns with max values larger than the limit to character.  Code I used is below:

                                                                proc import datafile = "&dir_fn"

                                                                                out = csv_to_sas                              replace

                                                                                dbms = dlm;

                                                                                guessingrows=max;

                                                                                delimiter = "&dlm";

                                                                run;

                                                               

                                                                /*  Examine all numeric columns and INPUT them as CHAR if necessary  */

                                                               

                                                                proc contents data = csv_to_sas

                                                                                                                   out = c                               noprint;

                                                                run;

                                                               

                                                                proc sql noprint;

                                                                                select catx(' ','max(',name,') as', name,'format=best32.')

                                                                                  into :maxes

                                                                                 separated by ','

                                                                                  from c

                                                                                 where type = 1

                                                                                 ;

                                                                                 

                                                                                select name, type

                                                                                  into :n1-, :t1-

                                                                                  from c

                                                                                 order by varnum

                                                                                ;

                                                                                 

                                                                                %let col_cnt = &sqlobs;

                                                                quit;

                                                               

                                                                proc sql;

                                                                                create table maxes as

                                                                                                select &maxes

                                                                                                  from csv_to_sas

                                                                                                ;

                                                                quit;

                                                               

                                                                proc datasets lib = work nolist;

                                                                                delete char;

                                                                quit;

                                                               

                                                                data char;

                                                                                length name $32.;

                                                                                set maxes;

                                                                                array nums _numeric_;

                                                                               

                                                                                do over nums;

                                                                                                if nums > input('9,007,199,254,740,992', comma32.) then do;

                                                                                                                name = vname(nums);

                                                                                                                output;

                                                                                                end;

                                                                                end;

                                                                run;

                                                               

                                                                %let data_step = N;

                                                                %if %sysfunc(exist(char)) %then %do;

                                                                                %if %attrn(char,nobs) > 0 %then %do;

                                                                                                %let data_step = Y;

                                                                                                %put Reimporting via data step: &dir_fn;

                                                                                %end;

                                                                %end;

                                                               

                                                                %if &data_step = Y %then %do;

 

                                                                                proc sort data = c;                            by name;             run;

                                                                                proc sort data = char;     by name;             run;

 

                                                                                data metadata;

                                                                                                merge c                                ( in = c )

                                                                                                                  char  ( in = chars );

                                                                                                   by name;

                                                                                                if chars then do;

                                                                                                                informat = '$19.';

                                                                                                                format  = informat;

                                                                                                end; else do;

                                                                                                                if informat > ' ' then do;

                                                                                                                                informat = cats(informat,put(informl,4.-l),'.',put(informd,4.-l));

                                                                                                                                format = informat;

                                                                                                                end;

                                                                                                end;

                                                                                run;

                                                                               

                                                                                /*  Write the datastep  */

                                                                               

                                                                                %let path             = %sysfunc(pathname(work));

 

                                                                                %put WORK DIR IS: &path;

 

                                                                                proc sort data = metadata;

                                                                                                by varnum;

                                                                                run;

                                                                               

                                                                                data data_step;

                                                                                                length line $256.;

 

                                                                                                file "&path/data_step.txt";

                                                                                               

                                                                                                line = 'data WORK.CSV_TO_SAS    ;';                                                                                                                                                                                                         output;                 put line;

                                                                                                line = '%let _EFIERR_ = 0; /* set the ERROR detection macro variable */';                                         output;                 put line;

                                                                                                line = "infile '&dir_fn' delimiter = '&dlm' MISSOVER DSD lrecl=32767 firstobs=2;";       output;                 put line;

                                                                                               

                                                                                                do until(eof);

                                                                                                               

                                                                                                                set metadata end = eof;

                                                                                                               

                                                                                                                if informat > ' ' then do;

                                                                                                                                line = catx(' ','informat',name, informat,';');

                                                                                                                                output;

                                                                                                                                put line;

                                                                                                                end;

                                                                                                                if format > ' ' then do;

                                                                                                                                line = catx(' ','format',name, format,';');

                                                                                                                                output;

                                                                                                                                put line;

                                                                                                                end;

                                                                                                end;

                                                                                               

                                                                                                line = 'input';                                     

                                                                                                output;                

                                                                                                put line;

                                                                                               

                                                                                                %do z = 1 %to &col_cnt;

                                                                                                                line = "&&n&z";

                                                                                                                %if &&t&z = 2 %then %do;

                                                                                                                                line = trim(line) || " $";

                                                                                                                %end;

                                                                                                               

                                                                                                                output;

                                                                                                                put line;

                                                                                                %end;

                                                                                               

                                                                                                line = ';';                               output; put line;

 

                                                                                                line = "if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection macro variable */";     output; put line;

                                                                                                line = "if _ERROR_ then call symputx('_EFIERR_',1);";                                                                                                                                                                       output; put line;

 

                                                                                                line = 'run;';        output; put line;

 

                                                                                                stop;

                                                                                run;

                                                                               

                                                                                /*  Run the data step import  */

                                                                                %include "&path/data_step.txt" /source2;

                                                               

                                                                %end;

ballardw
Super User

Account "numbers" may be many digits. But what about when you have something like part numbers that may only be 5 or 6 digits(with leading 0)? And your data set has other numeric values, like price that exceed those digits?

 

I personally think that if this is important than it is too important to leave to Proc Import because of all the other issues about different similarly structured data sets ending up with different properties of variable length and type.

Harry
Obsidian | Level 7

We're dealing with thousands of files and needed a programmatic method to write the data step code.

Tom
Super User
Super User

Use a different tool if you don't like how PROC IMPORT works.

For example you could use %CSV2DS() 

Which will guess to make integer values with a maximum length longer than 15 as character strings.

          when (missing(b.nonmiss)) then 'empty'
          when (b.maxlength in (1:15) and b.nonmiss=b.integer) then 'integer'
          when (b.nonmiss=b.integer) then 'character'
         ...

Plus it allows you to override its guesses. 

,overrides=   /* Name of SAS datasets with metadata overides (optional) */
ballardw
Super User

Just how similar are these "thousands of files"?

If there multiple files of a similar structure then use Proc import one time on one of them. Then SAVE the code from that run and modify as needed to be acceptable for other files of that structure. Changes would be in the Informat (if using the simplest change to the generated code).

Make sure character values are such and long enough to hold expected values.

Once you have one program that read a set of files correctly then you change the Infile and data set name. 

Repeat for the next file structure.

 

Or even modify the code to read multiple files into one data set if that make sense.

 

One would really hope that with "thousands of files" there is some documentation as to what the content should be with details like variable type and length to make sure you can modify the programs as needed.

 

Do similar column names in different files have the same properties even if the overall structure of the file is a bit different? In that case it might be worth writing a data set with the constraints and using that to write code to read files. I did that once upon a time where I was maintaining code that read telemetry from 70-odd stations where the contents of the files changed as instruments were tested. Generated something around 250K lines of code to read stuff with bits to read the time elements to implement the changes and code to detect the changes so the database could be updated.