Hoping this is even possible.
Essentially I have a table with several numeric customer data fields (eg credit score, income, age etc) and looking to layer any combination of two fields in order to create a population which accounts for at least 5% of the table.
For example customers with an age of less than 30 and an income of less than 25000 may account for - say - 3.5% of the table. My task is to therefore find a "sweet spot" of the two fields combined that would give me 5% of the table instead of 3.5%. So, perhaps age of less then 34 and income of less than 26000.
Question therefore is, how do I (evenly, if possible) scroll through each field and find that 5% sweet spot.
Any help hugely appreciated.
If you've got SAS/ETS licensed then look into Proc Surveyselect - Stratified Sampling
/*
If these two fields are numeric type,
You could try branks() function in SAS/IML.
*/
data have;
set sashelp.heart(obs=100);
keep weight height;
run;
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
rank=branks(x)[,3];
create rank var {rank};
append ;
close;
quit;
data temp;
merge have rank;
run;
proc rank data=temp out=want groups=20;
var rank;
ranks group;
run;
proc sort data=want;by group;run;
data want;
set want;
flag=ifn(group=0,0,1);
run;
proc freq data=want;
table flag/missing;
run;
This is just so sadly wrong that it is almost funny.
Without any other restriction you'll ALWAYS find 5% of records matching var1<x AND var2<y
Let's say 5% of records are 21 observations
order your dataset by var2
Set var1 to max of var1
count 21 observations and
Tadaaa you've got the result where var2 = value of var2 at obs 21+1
what bothers me is that you're trying to find a subgroup that reaches a significance level and that's just so wrong that it's difficult to know where to begin
Take sashelp.cars
one of the conditions you're looking for can be found at
where msrp LT $192,465 AND invoice LT$12,830
you get 21 obs that represent 5% of the 428 records in sashelp.cars
and that's just one of the results...
- Cheers -
Hello @DaveKerr and welcome to the SAS Support Communities!
In general, with two numeric variables (e.g., age and income) you will get a convex curve of eligible pairs of values (see PROC SGPLOT step in the code example below). So you'll need to define some sort of optimality criterion to determine the "optimal" point of the curve, e.g., minimize an objective function. Unless you have advanced tools (such as SAS/OR) available, you may need to categorize one of the two variables. Then you could obtain the relevant cutoff value for the other variable in each of the cumulative categories. In the case of age (assuming integer values, say, 18, 19, 20, ...), using the individual age values (as cutoffs) should be feasible. So you would determine the cutoff value for income in each of the groups age=18, age<=19, age<=20 and so on, i.e., the income X(age) such that the subset "income<=X(age)" in the respective group just comprises 5% of the entire dataset (which might exclude a few young age groups due to their size). Finally, apply the optimality criterion to those pairs of age and income cutoff values.
Below is an example. For simplicity I assume non-missing values for the relevant numeric variables (AGE and INCOME). In the presence of missing values you would need to exclude affected observations (since otherwise a condition such as age<=20 would also select observations with a missing age value).
/* Create sample data for demonstration */
data have;
call streaminit(3141592);
do i=1 to 100000;
age=rand('integer',18,65);
income=round(rand('lognormal',9.117+0.0531*age,age/240),0.01);
output;
end;
run;
/* Store 5% of the number of obs. in HAVE in macro variable N5P */
data _null_;
call symputx('n5p',ceil(0.05*n));
stop;
set have nobs=n;
run;
%put &=n5p;
/* Create view with income values of cumulative age groups */
proc sql;
create view vtemp as
select a.age, b.income
from (select distinct age from have a) left join have b
on b.age<=a.age
order by age, income;
quit;
/* Determine income cutoff value for each group (if &N5P obs. are available)
and compute objective function */
data opt;
do _n_=1 by 1 until(last.age);
set vtemp;
by age;
if _n_=&n5p then do;
c=log(income)**2+(age/20)**2; /* define your own objective function */
output;
end;
end;
run;
/* Optional: Visualize relationship between cutoff values */
proc sgplot data=opt;
series x=age y=income;
run;
/* Determine the combination of cutoff values minimizing the objective function */
proc summary data=opt;
var c;
output out=want(drop=_:) minid(c(age income))=opt_age opt_income;
run;
proc print data=want;
run;
/* Optional: Perform the selection */
data sel(drop=opt_:);
if _n_=1 then set want;
set have;
if age<=opt_age & income<=opt_income;
run;
Actually very simple.
You should be able to run the following code as SASHELP.Class should be in your install as a training data set.
proc freq data=sashelp.class; tables age*sex/list; run;
The list option makes a one row per combination of the variables. The table resulting will have a cumulative percent. Scroll down the cumulative percent column until the percent shown is at least as large as you want. Then select all the values of the two variables as your criteria. With this set that would be Age=11 and Sex=F as this set is small but that combination is 5.26% of the data. This approach does have a bias on sort order of the values but since I'm not really understanding the need and you said " any combination of two fields" this fills the requirement easily. Also easy to extend and if you need "at least 5% in one group and 15% in another group" you could just go down the list. Age=11 and Sex=F plus Age=12 and Sex=F is 21.05-5.16 or about 16%
OR you could look at the PERCENT column to see if a single row satisfies your need.
Age | Sex | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|---|
11 | F | 1 | 5.26 | 1 | 5.26 |
11 | M | 1 | 5.26 | 2 | 10.53 |
12 | F | 2 | 10.53 | 4 | 21.05 |
12 | M | 3 | 15.79 | 7 | 36.84 |
13 | F | 2 | 10.53 | 9 | 47.37 |
13 | M | 1 | 5.26 | 10 | 52.63 |
14 | F | 2 | 10.53 | 12 | 63.16 |
14 | M | 2 | 10.53 | 14 | 73.68 |
15 | F | 2 | 10.53 | 16 | 84.21 |
15 | M | 2 | 10.53 | 18 | 94.74 |
16 | M | 1 | 5.26 | 19 | 100.00 |
@DaveKerr wrote:
Hoping this is even possible.
Essentially I have a table with several numeric customer data fields (eg credit score, income, age etc) and looking to layer any combination of two fields in order to create a population which accounts for at least 5% of the table.
For example customers with an age of less than 30 and an income of less than 25000 may account for - say - 3.5% of the table. My task is to therefore find a "sweet spot" of the two fields combined that would give me 5% of the table instead of 3.5%. So, perhaps age of less then 34 and income of less than 26000.
Question therefore is, how do I (evenly, if possible) scroll through each field and find that 5% sweet spot.
Any help hugely appreciated.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.