Solved: Data is both character and numeric for large number of variables

Betsy · Posted 06-05-2020 04:35 PM

Hi all, it's my first time posting here.

I have 8 datasets each with 800+ variables that I am trying to stack. I am getting "ERROR: Variable X has been defined as both character and numeric."

I know that this error comes from some of the datasets defining this variable as character and some as numeric.

However, with such a large number of variables (the error occurs on at least 400 of them) it's impractical to switch each variable using put/input.

I am wondering if anyone has a solution that fixes a large number of these errors relatively quickly.

Thank you!!

Patrick · Posted 06-06-2020 12:52 AM

Proc Import will create data types (numeric or character) and lengths of the data based on what's in your external file. Proc Import will first analyse your data to determine the appropriate attributes. Make sure you use Guessingrow=Max so Proc Import analysis all data.

But even if doing so: If in your first .csv the maximum string length found for a column is 5 but in your next .csv it's 10 then you'll still have a truncation issue when combining the tables created by Proc Import.

Assuming all your .csv have the same structure (same column order) you could also first combine all the external data and then read it with a single Proc Import.

That's what below code is doing.

/* create sample .csv */
%let path_to_csv=%sysfunc(pathname(work));
data _null_;
  file "&path_to_csv/have1.csv";
  put 'col1,col2,col3,col4,col5';
  put 'a,bbb,5, 01jan2020,1234';

  file "&path_to_csv/have2.csv";
  put 'col1,col2,col3,col4,col5';
  put 'xx,9,yy,01feb2020,5678';
  stop;
run;

/* combine sample .csv into a single temporary file
   and store this temporary file in SAS Work
*/
filename alldata temp;
data _null_;
  file alldata lrecl=4000;
  infile datalines truncover;
  input fil2read $300.;
  fil2read=cats("&path_to_csv/",fil2read);
  if _n_= 1 then
    infile dummy1 filevar=fil2read lrecl=4000 end=done;
  else 
    infile dummy2 filevar=fil2read end=done lrecl=4000 firstobs=2;
  do while(not done);
    input;
    put _infile_;
  end;
  datalines;
have1.csv
have2.csv
;

/* read temporary file with all the data using Proc Import */
proc import   
  out=want
  file=alldata
  dbms=csv 
  replace
  ;
  guessingrows=max;
  delimiter=',';
  getnames=yes;
run;

/* delete the temporary file in Work */
filename alldata clear;

/* print result */
proc print data=want;
run;

N.B: Make sure you set the value for LRECL to a value that's sufficient for you longest line of source data.

View solution in original post

Reeza · Posted 06-05-2020 04:38 PM

How did you read in the data sets originally? If they are the same, you ideally should go back in your step and make sure they're read all correctly.

Betsy · Posted 06-05-2020 04:42 PM

Thanks for your reply! I read them all using proc import as csv. Usually when I have similar issues with other data, importing it as csv resolves it.

Reeza · Posted 06-05-2020 05:29 PM

You shouldn't use proc import for your CSVs use a data step. If you have one file read with the correct types and lengths, then use that code to read the rest of the files. You can try add the GUESSINGROWS=MAX option to the PROC IMPORT as well. It will slow things down a lot but more likely to be accurate.

You get the code from PROC IMPORT, find the one that works the best, adjust as needed and use that code to read them all into a single data step at once. If all your text files are in a single folder you can use a wildcard, if they are not, you can use the filevar option to pipe the file names in. If you check the library on here, there's an example of how to read all text files into a single data step in one step in there.

https://communities.sas.com/t5/SAS-Communities-Library/How-do-I-write-a-macro-to-import-multiple-tex...

ballardw · Posted 06-05-2020 05:32 PM

@Betsy wrote:

Thanks for your reply! I read them all using proc import as csv. Usually when I have similar issues with other data, importing it as csv resolves it.

Did you use a large value for the GUESSINGROWS option? Or the option at all?

Typically Proc Import only examines 20 rows to set variable types. For text files you have the option of setting a larger number with the Guessingrows option. It may run a little slower but you should see more consistent results.

OR

If all if all of the files have the same layout import one with the Guessingrows=Max; or 32000, option. The data step used to read the data will be in the log. Copy the data step and paste into the editor. Clean up the code by removing line numbers and such. Then change the Infile file name and the Data set output name to read each file the same.

This last will reduce the occurrence of almost certain to appear warnings of the lengths of variables differing and may result in data truncation. And likely would with multiple imports.

I usually examine the INFORMAT statements generate by Proc Import to see if I think the lengths and types are appropriate.

Warning: if you see a variable with a $1. informat it is likely that the variable had no values in the file and you may need to check documentation or the other files for example data to set a more useable informat when the data does appear. This is typical for sets with seldom recorded values such as "Income from major lottery win" or characteristics of 4th or more infant delivered in a single birth (I would say this is poor data structure but sometimes we don't get to pick our sources).

Kurt_Bremser · Posted 06-06-2020 01:03 AM

DO NOT USE PROC IMPORT!

For CSV files, use a data step; only then will you have full control over the process, including all variable attributes.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 06-06-2020 12:52 AM

Proc Import will create data types (numeric or character) and lengths of the data based on what's in your external file. Proc Import will first analyse your data to determine the appropriate attributes. Make sure you use Guessingrow=Max so Proc Import analysis all data.

But even if doing so: If in your first .csv the maximum string length found for a column is 5 but in your next .csv it's 10 then you'll still have a truncation issue when combining the tables created by Proc Import.

Assuming all your .csv have the same structure (same column order) you could also first combine all the external data and then read it with a single Proc Import.

That's what below code is doing.

/* create sample .csv */
%let path_to_csv=%sysfunc(pathname(work));
data _null_;
  file "&path_to_csv/have1.csv";
  put 'col1,col2,col3,col4,col5';
  put 'a,bbb,5, 01jan2020,1234';

  file "&path_to_csv/have2.csv";
  put 'col1,col2,col3,col4,col5';
  put 'xx,9,yy,01feb2020,5678';
  stop;
run;

/* combine sample .csv into a single temporary file
   and store this temporary file in SAS Work
*/
filename alldata temp;
data _null_;
  file alldata lrecl=4000;
  infile datalines truncover;
  input fil2read $300.;
  fil2read=cats("&path_to_csv/",fil2read);
  if _n_= 1 then
    infile dummy1 filevar=fil2read lrecl=4000 end=done;
  else 
    infile dummy2 filevar=fil2read end=done lrecl=4000 firstobs=2;
  do while(not done);
    input;
    put _infile_;
  end;
  datalines;
have1.csv
have2.csv
;

/* read temporary file with all the data using Proc Import */
proc import   
  out=want
  file=alldata
  dbms=csv 
  replace
  ;
  guessingrows=max;
  delimiter=',';
  getnames=yes;
run;

/* delete the temporary file in Work */
filename alldata clear;

/* print result */
proc print data=want;
run;

N.B: Make sure you set the value for LRECL to a value that's sufficient for you longest line of source data.

Data is both character and numeric for large number of variables

Re: Data is both character and numeric for large number of variables

Re: Data is both character and numeric for large number of variables

Re: Data is both character and numeric for large number of variables

Re: Data is both character and numeric for large number of variables

Re: Data is both character and numeric for large number of variables

Re: Data is both character and numeric for large number of variables

Re: Data is both character and numeric for large number of variables

The 2025 SAS Hackathon has begun!