Hello
I am building a new logistic regression model.
Part of the analysis is to calculate IV (information value) for the independent variables of the new model.
I wrote a code that calculate IV .
The problem is that Instead of see the categories formats names I see the categories values (1,2,3, and so on).
I found a solution by writing IF statements to apply formats.
However, this solution is not efficient because in real life I want to test multiple models and each time to check IV's only for the variables of the model that I check.
My question:
Is there a way to see formatted values of categories that is done automatically ?
The target is to take the formats for varaibles X1,X2,X3,X4,X5 from source data set ttt
PROC FORMAT ;
VALUE IVfmt
low-<0.02='(d) Very Low'
0.02 <- 0.1 ='(c) Low'
0.1 <- 0.3 ='(b) Medium'
0.3 <- HIGH ='(a) High'
;
RUN;
proc format;
value f1ff
1='a1'
2='b1'
3='c1'
4='d1'
;
value f2ff
1='a2'
2='b2'
3='c2'
4='d2'
;
value f3ff
1='a3'
2='b3'
3='c3'
;
value f4ff
1='a4'
2='b4'
3='c4'
;
value f5ff
1='a5'
2='b5'
3='c5'
4='d5'
;
Run;
Data ttt;
Retain Id X1 X2 X3 X4 X5;
Format X1 f1ff. X2 f2ff. X3 f3ff. X4 f4ff. X5 f5ff.;
Input ID x1 x2 x3 x4 x5 Ind_Fail;
label
ID='Customer ID'
X1='Team'
X2='Indicator big/small'
X3='Light strength'
X4='Location'
X5='Country'
;
cards;
6 4 4 3 2 4 1
1 2 1 3 1 2 0
4 4 3 3 2 1 0
2 3 2 2 1 2 0
3 1 2 1 2 3 1
5 2 3 2 3 2 0
7 3 2 1 2 1 1
;
Run;
%let V=x1 x2 x3 x4 x5;
proc sort data=ttt;
by ID;
run;
proc transpose data=ttt out=Long_Raw_
(rename=(_name_=field _label_=Fieldlabel col1=category));
by ID;
var &V.;
run;
data Long_Raw;
set Long_Raw_;
category=left(category);
run;
PROC SQL;
create table LONG_RAW_b as
select a.*,b.Ind_Fail
from LONG_RAW as a
left join ttt as b
on a.ID=b.ID
;
QUIT;
PROC SQL;
create table Summary_IV1 as
select field,Fieldlabel, category,
count(*) as nr_laks,
sum(Ind_Fail ) as nr_laks_bad,
sum(calculated nr_laks,-calculated nr_laks_bad) as nr_laks_good,
calculated nr_laks_bad/calculated nr_laks as pct_laks_bad format=percent8.2
from LONG_RAW_b
group by field,Fieldlabel, category
order by field,Fieldlabel,category
;
QUIT;
PROC SQL;
create table Summary_IV2 as
select field,
sum(Ind_Fail ) as Total_laks_bad,
sum(case when Ind_Fail =0 then 1 else 0 end) as Total_laks_good
from LONG_RAW_b
group by field
;
QUIT;
PROC SQL;
create table Summary_IV3 as
select a.*,
a.nr_laks_bad/b.Total_laks_bad as Tamhil_bad,
a.nr_laks_good/b.Total_laks_good as Tamhil_good,
log(calculated Tamhil_good/calculated Tamhil_bad) as WOE ,
calculated WOE*(calculated Tamhil_good-calculated Tamhil_bad) as IV
from Summary_IV1 as a
left join Summary_IV2 as b
on a.field=b.field
;
QUIT;
PROC SQL;
create table Summary_IV4 as
select field,' TOTAL' as Fieldlabel,
sum(nr_laks) as nr_laks,
sum(nr_laks_bad) as nr_laks_bad,
sum(nr_laks_good) as nr_laks_good,
calculated nr_laks_bad/calculated nr_laks as pct_laks_bad format=percent8.2,
sum(Tamhil_bad) as Tamhil_bad,
sum(Tamhil_good) as Tamhil_good,
sum(IV) as IV,
put(calculated IV,IVfmt.) as IV_desc
from Summary_IV3
group by field
;
QUIT;
proc sort data=Summary_IV3;by field;run;
proc sort data=Summary_IV4;by field;run;
Data Final_IV;
Retain category category_new;
length category_new $50.;
set Summary_IV3 Summary_IV4;
by field;
format nr_laks nr_laks_bad comma21. Tamhil_bad Tamhil_good WOE IV comma8.2;
/**My question is about the following code to apply format to category var**/
/**IS there a way to apply formats automatically from the source data set ttt**/
IF field='X1' then category_new=put(category,f1ff.);
else IF field='X2' then category_new=put(category,f2ff.);
else IF field='X3' then category_new=put(category,f3ff.);
else IF field='X4' then category_new=put(category,f4ff.);
else IF field='X5' then category_new=put(category,f5ff.);
Run;
proc sort data=Final_IV;
by field category;
Run;
title 'Information value for varaibles in this model';
proc print data=Final_IV noobs;
run;
This is an awful lot of code, and then you don't specify exactly where in the code you are having problems.
I also don't really grasp the meaning of your statements, could you please give an example of what you are seeing and the desired solution for these 3 statements:
select field format=f1ff. ,Fieldlabel, category,
You could add a format after a variable.
Yes, but I run a dynamic code.....Each time I run a different list of variables and for these varaibles I want to calculate IV.
In the methos that you show I need to manually type the format for each varaible....It is not efficient for me.
select field format=f1ff. ,Fieldlabel, category,
I will make my question more specific:
In proc transpose code I loose the format of the categories values.(Because I get one column....)
What is the way to trasfer the values into formatted values in column Category?
Then in the Inforation value output I can see the meaning of each value of any category
proc transpose data=ttt out=Long_Raw_
(rename=(_name_=field _label_=Fieldlabel col1=category));
by ID;
var &V.;
run;
Formats apply to the whole variable. So if you are transposing the values so that the resulting numeric column has values from different original variables that used different formats you will need to first store the formatted value into a character variable.
select
field as raw_value
, put(field,f1ff.) as formatted_value
, Fieldlabel
, category
, ...
@Ronein wrote:
The raw data is external file with 500 varaibles (numeric) and each varaible has a format.
You suggest me to tranform each numeric value into its formatted value (char).
It is absolutley good solution but how can I do it for 500 varaibles in a clever way???
This is the question
That is a a very simple thing.
data want ;
set have ;
array var var1-var500 ;
length varname $32 value 8 decode $200 ;
do index=1 to dim(var);
varname=vname(var[index]);
value=var[index];
decode=vvaluex(varname);
output;
end;
keep id varname value decode ;
run;
Just replace VAR1-VAR500 in the ARRAY statement with your list of variable names. Replace ID in the KEEP statement with the list of variables that uniquely identify an observation in the original HAVE dataset.
Any where you expect a formatted value to appear you need to make sure that the variable with the value is associated with the format.
Since the only place I see a format associated with variables is in data TTT only procedures that directly use TTT can see the formats. Most of the work you are doing is with a transposed data set, or results of using that transposed data. The Transposed data does not have the formats because the VAR variables are now combined in columns and the original formats can't be applied.
If you expect one assignment, such as in TTT to maintain the formats then you need to watch which variables you keep where and use so that the formatted values are available.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.