Hi!
I have a dataset with hundreds (on the order of 400-500) of indicator variables (values 1,2, or 3) that I imported from Excel (.xlsx) and they have imported as character values. I need them to be numeric. If relevant, the data were transposed after importing (initial data look like mixed data type - see below for an example). They are not all clustered together, so I cannot use something like array cha(*) var1 -- var450 ...
Is there any way to convert hundreds of "truly numeric" variables that are masquerading as character variables, and are scattered among legitimate character variables, in any way that avoids me having to list all of them or convert them one-by-one?
Original data imported:
City1 City2 City3 City4 City5
Var1 words words words words words
Var2 1 2 1 2 2
Var3 2 2 1 1 1
Var4 words words words words words
Var5 words words words words words
After transposing:
Var1 Var2 Var3 Var4 Var5
City1 words 1 2 words words
City2 words 2 2 words words
City3 words 1 1 words words
City4 words 2 1 words words
City5 words 2 1 words words
I'd love some suggestions on how to fix this! I know how to do this easily in R, perl, bash, and other programming languages, but at the moment I am constrained to using SAS.
I think that the following will do what you want:
data have;
input (City Var1 x y z Var5) ($);
cards;
City1 words 1 2 words words
City2 2 2 2 words words
City3 words 1 . words words
City4 words 2 1 words words
City5 words 2 1 words words
;
proc format;
value $allnum
other=[anyalpha()];
run;
ods output onewayfreqs=work.owf;
proc freq data=have;
format var1--var5 $allnum.;
tables var1--var5/ missing;
run;
ods listing;
data owf;
length var $32 fvalue $50 ;
set owf;
by table notsorted;
if first.table and last.table;
var=scan(table,-1);
fvalue=vvaluex('F_'||var);
if fvalue eq 0;
keep var;
run;
proc sql noprint;
select catt(var,'=_',var),
catt(var,'=input(_',var,',8.);')
into :rens separated by ' ',
:vars separated by ' '
from owf
;
quit;
data want (drop=_:);
set have (rename=(&rens.));
&vars.;
run;
Art, CEO, AnalystFinder.com
Can you use this notation which would select all character variables between first and last variable?
array to_convert(*) first_var -character- last_var;
@sovrappensiero wrote:
Hi!
I have a dataset with hundreds (on the order of 400-500) of indicator variables (values 1,2, or 3) that I imported from Excel (.xlsx) and they have imported as character values. I need them to be numeric. If relevant, the data were transposed after importing (initial data look like mixed data type - see below for an example). They are not all clustered together, so I cannot use something like array cha(*) var1 -- var450 ...
Is there any way to convert hundreds of "truly numeric" variables that are masquerading as character variables, and are scattered among legitimate character variables, in any way that avoids me having to list all of them or convert them one-by-one?
Original data imported:
City1 City2 City3 City4 City5
Var1 words words words words words
Var2 1 2 1 2 2
Var3 2 2 1 1 1
Var4 words words words words words
Var5 words words words words words
After transposing:
Var1 Var2 Var3 Var4 Var5
City1 words 1 2 words words
City2 words 2 2 words words
City3 words 1 1 words words
City4 words 2 1 words words
City5 words 2 1 words words
I'd love some suggestions on how to fix this! I know how to do this easily in R, perl, bash, and other programming languages, but at the moment I am constrained to using SAS.
@sovrappensiero wrote:
Hi!
I have a dataset with hundreds (on the order of 400-500) of indicator variables (values 1,2, or 3) that I imported from Excel (.xlsx) and they have imported as character values. I need them to be numeric. If relevant, the data were transposed after importing (initial data look like mixed data type - see below for an example). They are not all clustered together, so I cannot use something like array cha(*) var1 -- var450 ...
Easiest may be to go back to the IMPORT stage.
First save the Excel file to a CSV file. (Reason: there are options available for delimited files not available for Excel).
Import the CSV.
If using Proc Import code then add the option GUESSINGROWS=MAX. If using a wizard find the options and set the "number of rows to guess" to a very large number, ideally the number of rows in the source file. Also indicate which row of data contains the first actual data row. A common cause of problems when there is one or more blank header rows. With Excel you can't tell SAS to ignore them. With a CSV file you can indicate the first row of data with the wizard or the DATAROW statement in Proc Import code.
Be default the import procedure guesses as to variable type using a very few rows of data. If the full range of values are not encountered in the first rows then the procedure guesses wrong.
Ok, so if you know how to do this in R, how would you identify which variables in R would need to be converted?
Unfortunately, given the data structure - in rows rather than columns I don't think simply converting the file to CSV would help.
Except if you take the current data, export it to CSV and then read it back in as numeric. But you'd have to specify the character types anyways and if you're going to do that anyways, you may as well clean it up in SAS.
I would be happy to be wrong though 🙂
@sovrappensiero wrote:
Thank you I will try to import it this way!
@sovrappensiero wrote:
Good question. Firstly, I think they would automatically import as numeric in R. If not, I could re-arrange the columns so that all the numeric ones are together(a real pain, but I searched for how to do this in SAS and I couldn’t find a way although by that time I was a little tired and hungry, LOL). Then I could use an apply function to convert all the variables from column 1 to whatever using the as.numeric() function. I ran into a problem with the array in SAS for two reasons:
Assuming the first data set is what you had in Excel, I don't think it would in R if your data is in mixed types as shown. Even R forces types on columns not along rows.
@sovrappensiero wrote:
1. When SAS cannot concert a legitimate character variable to numeric, instead of ignoring it and throwing a warning it converts it to missing. So I cannot use _ALL_ and apply a conversion function hoping that it would “smart convert” only things that look like numbers go numeric format.
You can test this using ANYDIGIT or ANYALPHA and then applying the conversion conditionally.
@sovrappensiero wrote:
2. Because my variables are scattered, there is no easy way to rename them. SAS requires you that, for example newvar = input(oldvar, best8.). In R you can change variables on the fly.
Unfortunately, this is definitely an issue that SAS needs to resolve. One way is to add a prefix to all automatically using the SASHELP.VCOLUMN table. Then you can keep track of which ones you need to convert and rename after the fact or delete all variables that are all missing. This is definitely not an easy, beginner level process though.
If you're doing this one time, I would probably copy and transpose the data in Excel and then re-import into SAS using GUESSINGROWS=MAX as suggested earlier.
If this needs to be repeated, that's obviously not a good solution!
@sovrappensiero wrote:
Hi!
I have a dataset with hundreds (on the order of 400-500) of indicator variables (values 1,2, or 3) that I imported from Excel (.xlsx) and they have imported as character values. I need them to be numeric. If relevant, the data were transposed after importing (initial data look like mixed data type - see below for an example). They are not all clustered together, so I cannot use something like array cha(*) var1 -- var450 ...
Is there any way to convert hundreds of "truly numeric" variables that are masquerading as character variables, and are scattered among legitimate character variables, in any way that avoids me having to list all of them or convert them one-by-one?
Original data imported:
City1 City2 City3 City4 City5
Var1 words words words words words
Var2 1 2 1 2 2
Var3 2 2 1 1 1
Var4 words words words words words
Var5 words words words words words
After transposing:
Var1 Var2 Var3 Var4 Var5
City1 words 1 2 words words
City2 words 2 2 words words
City3 words 1 1 words words
City4 words 2 1 words words
City5 words 2 1 words words
I'd love some suggestions on how to fix this! I know how to do this easily in R, perl, bash, and other programming languages, but at the moment I am constrained to using SAS.
Since the data structure wasn't included the previous comments I made won't work.
However, have you ever used the EXCEL Copy>Paste Special>Transpose feature? Highlight all the data, copy, go to a new sheet or work book and select the Paste Special and Transpose. Save the result as CSV and import.
One hopes that you do not have lots of files of this type.
Something like this may work (using the test data supplied by @Ksharp😞
data test;
input (varr City1 City2 City3 City4 City5) ($);
cards;
Var1 words words words words words
Var2 1 2 1 . 2
Var3 2 2 1 1 1
Var4 words words words words words
Var5 words words words words words
;run;
First, transpose:
proc transpose data=test out=trans;
var city:;
id varr;
run;
Then create a temporary program file to convert the numeric variables:
filename tempsas temp;
data _null_;
set test;
file tempsas;
array test(*) City:;
do _N_=1 to dim(test);
if input(test(_N_),?? best32.)=. and test(_N_) not in('.',' ') then
delete;
end;
put
'_' varr '=input(' varr ',best32.);' /
'drop ' varr ';' /
'rename _' varr '=' varr ';'
;
run;
Then use that to convert the numeric variables:
data want;
set trans;
%include tempsas/source2;
run;
One possibility is to just dump your current (converted) dataset to a delimited file and then let PROC IMPORT figure out which variables are numeric.
filename csv temp;
proc export data=have outfile=csv dbms=csv ; run;
proc import datafile=csv out=want dbms=csv; run;
Otherwise If your existing conversion process was a classic double transpose then you could try analyzing the middle (tall skinny) version of the data to find out the names of the variables whose set of values only include '1','2' and '3' and use that list in your char to num conversion step.
I think that the following will do what you want:
data have;
input (City Var1 x y z Var5) ($);
cards;
City1 words 1 2 words words
City2 2 2 2 words words
City3 words 1 . words words
City4 words 2 1 words words
City5 words 2 1 words words
;
proc format;
value $allnum
other=[anyalpha()];
run;
ods output onewayfreqs=work.owf;
proc freq data=have;
format var1--var5 $allnum.;
tables var1--var5/ missing;
run;
ods listing;
data owf;
length var $32 fvalue $50 ;
set owf;
by table notsorted;
if first.table and last.table;
var=scan(table,-1);
fvalue=vvaluex('F_'||var);
if fvalue eq 0;
keep var;
run;
proc sql noprint;
select catt(var,'=_',var),
catt(var,'=input(_',var,',8.);')
into :rens separated by ' ',
:vars separated by ' '
from owf
;
quit;
data want (drop=_:);
set have (rename=(&rens.));
&vars.;
run;
Art, CEO, AnalystFinder.com
@art297, I am working through your solution right now. It's so elegant and as I figure out what the SAS procedures are I keep thinking, "Brilliant!" Thank you so much for this insight. The one part that I do not understand is the SQL part...I can't figure out what it is doing. I took an intro SQL class so I know that it's selecting a subset from work.owf and putting them into something...but I'm a little lost!
Any explanation you can provide would surely help! Thanks again.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.