I have 5 datasets in Excel files (5 different time points w/ the same variables, for the most part). Because of Excel’s formats, I have painstakingly converted variables in both directions - character to numeric and numeric to character - using arrays in numerous SAS programs throughout this project. It has been a huge effort.
In the final phase I am trying to stack the datasets using the code below. I received 83 errors stating that “Variable XXX has been defined as both character and numeric.”
I’m assuming that this means that in different datasets the variable has different data types. For example, VAR1 is character in one dataset and numeric in another. Most of the solutions in this forum describe converting the variables that are causing the errors.
Do I need to check each dataset to see the data type of each variable and then convert them all to the desired type? Should I do this in each dataset or is there a way I can do this globally?
Thanks for the help.
DATA a1_a5_final; SET a1_final a2_final a3_final a4_final a5_final; RUN;
If the variables causing the problems are not required, you could use keep= or drop= lists on each dataset to exclude the problem variables. You can also rename=() some variables so that both versions (character and numeric) of the variables are not named the same.
This is one of the side-effects of "working" (actually: wasting time) with Excel.
The sooner you get rid of Excel as a "data" source, the better.
Since your source is completely unreliable, and it is very hard (up to impossible) to set up globally correct rules for automatic type conversions, you will be stuck with eyeballing through the datasets and manually correcting your input data steps.
Saving Excel data in a sensible transfer format (.CSV!!!) will be the first step.
Well, the Real World might involve more categories that character and numeric. Real World people appreciate the flexibility of spreadsheets to communicate values such as "<0.01" or "0.01?" or "See Note 2" to fellow real world people.
If you need data from real world people, you can either try to pay them to fit your requirements or spend some time to adapt your data importing procedure.
Whilst I agree somewhat with what your saying here, there are many systems and process developed over the years to deal with data transfer. These include databases and file formats. In this day and age where there are free versions of these, well documented and implemented in many different environments, the question does arise, why is something like Excel still being so widely used? Sure it is flexible on the user end, but that really is where the benefits end. As an unstructured document format, its is not - without some work - compatible with data processing. The file format is proprietary, although they have made some steps towards and XML implementation, it is still a zip which can contain binary components, and is not a flat text file, hence not as portable or simple to understand or use further. Whilst its unlikely we will ever see tha back of Excel, its always good to point out its deficiencies, and to suggest alternatives, otherwise we end up with a situation where there is nothing but Excel, and I have seen people trying to use it as a code editor, a database etc. which is madness, I have even seen a game engine built in it - which does attest to its flexibility and also peoples inate ability to try to break things.
As @Kurt_Bremser has stated, save yourself a headache and ditch Excel. It is not a good data format. Save to CSV, then write a datastep import program where you can specify lengths/formats etc. and if they are all the same structure of data you can read them all in one step:
data want; infile "<your_path>\*.csv" dlm=","; length a $20 b 8; input a $ b; run;
I profundly agree with what has been said about using Excel as a database, but would like to answer the question without work around. So assuming you rely on both Excel database and SAS import, you could automatically unify the variable names and types to ensure correct setting.
proc datasets lib=work kill nolist;run;quit;
** Create test data;
data A1_FINAL;
a=1;
b='a1 ';
c=.;
output;
run;
data A2_FINAL;
a=1;
b='a2 ';
c=.;
output;
run;
data A3_FINAL;
a=1;
b='a3 ';
c=.;
output;
run;
data A4_FINAL;
a=1;
b='a4 ';
c=.;
output;
run;
data A5_FINAL;
a='a5';
b=5;
c=.;
output;
run;
%macro renameWorkDs(ds);
data &ds._mod;
set &ds.;
run;
%mend;
%renameWorkDs(A1_FINAL);
%renameWorkDs(A2_FINAL);
%renameWorkDs(A3_FINAL);
%renameWorkDs(A4_FINAL);
%renameWorkDs(A5_FINAL);
** select tables in which same variables have different lengths or types;
proc sql;
create table all_t1 as
select memname,name,type,case when type eq 'num' then length*2 else length end as length
from sashelp.vcolumn
where libname eq 'WORK' and memname in ('A1_FINAL', 'A2_FINAL', 'A3_FINAL', 'A4_FINAL', 'A5_FINAL')
group by name
having count(distinct type) gt 1 or count(distinct length) gt 1
order by name, length desc , type
;
quit;
** Prepare statement to unify length and type of these variables on all tables;
data all_t2;
length code $200;
set all_t1;
by name descending length type;
retain maxlength;
if first.name then maxlength=length;
if type eq 'num' then do;
code='attrib '||strip(name)||'_tmp length=$'||strip(put(maxlength,best32.))||'; '
||strip(name)||'_tmp=strip(put('||strip(name)||',best32.)); '
||'drop '||strip(name)||';';
end;
else do;
code='attrib '||strip(name)||'_tmp length=$'||strip(put(maxlength,best32.))||'; '
||strip(name)||'_tmp=put('||strip(name)||',$'||strip(put(maxlength,best32.))||'.); '
||'drop '||strip(name)||';';
end;
run;
proc sort data=all_t2;
by memname name;
run;
** Perform unification;
data _NULL_;
set all_t2 end=last;
by memname name;
if first.memname then call execute('data '||strip(memname)||'_mod; set '||strip(memname)||'_mod; ');
call execute(strip(code));
if last then call execute('run;');
run;
DATA a1_a5_final;
SET
a1_final_mod
a2_final_mod
a3_final_mod
a4_final_mod
a5_final_mod;
RUN;
** clean work, rename variables to their original names;
data _NULL_;
set sashelp.vcolumn end=last;
where libname eq 'WORK' and memname eq 'A1_A5_FINAL' and substr(name,length(name)-3) eq '_tmp';
if _N_ eq 1 then call execute('proc datasets lib=work;
delete all_t1 all_t2 a1_final_mod a2_final_mod a3_final_mod a4_final_mod a5_final_mod
;
modify A1_A5_FINAL;
');
call execute(' rename '||strip(name)||'='||substr(name,1,length(name)-4)||';');
if last then call execute('run;quit;');
run;
- Cheers -
Erm, that only sets together for differeing lengths does it not? That can be achieved with an SQL trick, union all will automatically take the longest length:
proc sql; create table WANT as select A,B,C from A1_FINAL union all select A,B,C from A2_FINAL union all select A,B,C from A3_FINAL; quit;
The original question was different types, ie. sometimes A would be character, sometimes numeric. You would have to have set rules to convert from one to the other - character would be safest as it can handle all characters, but should the column be read into character as date, number, formatted etc. Type change is not simple.
Nope, it handles different types as well
- Cheers -
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.