BookmarkSubscribeRSS Feed
DaveKerr
Calcite | Level 5

Hoping this is even possible.

 

Essentially I have a table with several numeric customer data fields (eg credit score, income, age etc) and looking to layer any combination of two fields in order to create a population which accounts for at least 5% of the table.

 

For example customers with an age of less than 30 and an income of less than 25000 may account for - say - 3.5% of the table. My task is to therefore find a "sweet spot" of the two fields combined that would give me 5% of the table instead of 3.5%. So, perhaps age of less then 34 and income of less than 26000.

 

Question therefore is, how do I (evenly, if possible) scroll through each field and find that 5% sweet spot.

 

Any help hugely appreciated.

 

 

5 REPLIES 5
Ksharp
Super User
/*
If these  two fields are numeric type,
You could try branks() function in SAS/IML.
*/
data have;
 set sashelp.heart(obs=100);
 keep weight height;
run;



proc iml;
use have;
read all var _all_ into x[c=vname];
close;
rank=branks(x)[,3];
create rank var {rank};
append ;
close;
quit;
data temp;
 merge have rank;
run;
proc rank data=temp out=want groups=20;
var rank;
ranks group;
run;
proc sort data=want;by group;run;
data want;
 set want;
 flag=ifn(group=0,0,1);
run;

proc freq data=want;
 table flag/missing;
run;
Oligolas
Barite | Level 11

This is just so sadly wrong that it is almost funny.

Without any other restriction you'll ALWAYS find 5% of records matching var1<x AND var2<y

Let's say 5% of records are 21 observations

order your dataset by var2

Set var1 to max of var1

count 21 observations and 

Tadaaa you've got the result where var2 = value of var2 at obs 21+1

 

what bothers me is that you're trying to find a subgroup that reaches a significance level and that's just so wrong that it's difficult to know where to begin

 

Take sashelp.cars

one of the conditions you're looking for can be found at 

where msrp LT $192,465 AND invoice LT$12,830

you get 21 obs that represent 5% of the 428 records in sashelp.cars

 

and that's just one of the results...

 

 

________________________

- Cheers -

FreelanceReinh
Jade | Level 19

Hello @DaveKerr and welcome to the SAS Support Communities!

 

In general, with two numeric variables (e.g., age and income) you will get a convex curve of eligible pairs of values (see PROC SGPLOT step in the code example below). So you'll need to define some sort of optimality criterion to determine the "optimal" point of the curve, e.g., minimize an objective function. Unless you have advanced tools (such as SAS/OR) available, you may need to categorize one of the two variables. Then you could obtain the relevant cutoff value for the other variable in each of the cumulative categories. In the case of age (assuming integer values, say, 18, 19, 20, ...), using the individual age values (as cutoffs) should be feasible. So you would determine the cutoff value for income in each of the groups age=18, age<=19, age<=20 and so on, i.e., the income X(age) such that the subset "income<=X(age)" in the respective group just comprises 5% of the entire dataset (which might exclude a few young age groups due to their size). Finally, apply the optimality criterion to those pairs of age and income cutoff values.

 

Below is an example. For simplicity I assume non-missing values for the relevant numeric variables (AGE and INCOME). In the presence of missing values you would need to exclude affected observations (since otherwise a condition such as age<=20 would also select observations with a missing age value).

/* Create sample data for demonstration */

data have;
call streaminit(3141592);
do i=1 to 100000;
  age=rand('integer',18,65);
  income=round(rand('lognormal',9.117+0.0531*age,age/240),0.01);
output;
end;
run;

/* Store 5% of the number of obs. in HAVE in macro variable N5P */

data _null_;
call symputx('n5p',ceil(0.05*n));
stop;
set have nobs=n;
run;

%put &=n5p;

/* Create view with income values of cumulative age groups */

proc sql;
create view vtemp as
select a.age, b.income
from (select distinct age from have a) left join have b
on b.age<=a.age
order by age, income;
quit;

/* Determine income cutoff value for each group (if &N5P obs. are available)
   and compute objective function */

data opt;
do _n_=1 by 1 until(last.age);
  set vtemp;
  by age;
  if _n_=&n5p then do;
    c=log(income)**2+(age/20)**2; /* define your own objective function */
    output;
  end;
end;
run;

/* Optional: Visualize relationship between cutoff values */

proc sgplot data=opt;
series x=age y=income;
run;

/* Determine the combination of cutoff values minimizing the objective function */

proc summary data=opt;
var c;
output out=want(drop=_:) minid(c(age income))=opt_age opt_income;
run;

proc print data=want;
run;

/* Optional: Perform the selection */

data sel(drop=opt_:);
if _n_=1 then set want;
set have;
if age<=opt_age & income<=opt_income;
run;

 

ballardw
Super User

Actually very simple.

You should be able to run the following code as SASHELP.Class  should be in your install as a training data set.

proc freq data=sashelp.class;
  tables age*sex/list;
run;

The list option makes a one row per combination of the variables. The table resulting will have a cumulative percent. Scroll down the cumulative percent column until the percent shown is at least as large as you want.  Then select all the values of the two variables as your criteria. With this set that would be Age=11 and Sex=F as this set is small but that combination is 5.26% of the data.  This approach does have a bias on sort order of the values but since I'm not really understanding the need and you said " any combination of two fields" this fills the requirement easily. Also easy to extend and if you need "at least 5% in one group and 15% in another group" you could just go down the list. Age=11 and Sex=F plus Age=12 and Sex=F is 21.05-5.16 or about 16%

OR you could look at the PERCENT column to see if a single row satisfies your need.

Age Sex Frequency Percent Cumulative
Frequency
Cumulative
Percent
11 F 1 5.26 1 5.26
11 M 1 5.26 2 10.53
12 F 2 10.53 4 21.05
12 M 3 15.79 7 36.84
13 F 2 10.53 9 47.37
13 M 1 5.26 10 52.63
14 F 2 10.53 12 63.16
14 M 2 10.53 14 73.68
15 F 2 10.53 16 84.21
15 M 2 10.53 18 94.74
16 M 1 5.26 19 100.00

 


@DaveKerr wrote:

Hoping this is even possible.

 

Essentially I have a table with several numeric customer data fields (eg credit score, income, age etc) and looking to layer any combination of two fields in order to create a population which accounts for at least 5% of the table.

 

For example customers with an age of less than 30 and an income of less than 25000 may account for - say - 3.5% of the table. My task is to therefore find a "sweet spot" of the two fields combined that would give me 5% of the table instead of 3.5%. So, perhaps age of less then 34 and income of less than 26000.

 

Question therefore is, how do I (evenly, if possible) scroll through each field and find that 5% sweet spot.

 

Any help hugely appreciated.

 

 


 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 444 views
  • 0 likes
  • 6 in conversation