Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- How do I combine two fields to make up at least 5%25 of a table %3F

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-14-2023 04:26 AM
(225 views)

Hoping this is even possible.

Essentially I have a table with several numeric customer data fields (eg credit score, income, age etc) and looking to layer any combination of two fields in order to create a population which accounts for at least 5% of the table.

For example customers with an age of less than 30 and an income of less than 25000 may account for - say - 3.5% of the table. My task is to therefore find a "sweet spot" of the two fields combined that would give me 5% of the table instead of 3.5%. So, perhaps age of less then 34 and income of less than 26000.

Question therefore is, how do I (evenly, if possible) scroll through each field and find that 5% sweet spot.

Any help hugely appreciated.

5 REPLIES 5

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If you've got SAS/ETS licensed then look into Proc Surveyselect - Stratified Sampling

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

```
/*
If these two fields are numeric type,
You could try branks() function in SAS/IML.
*/
data have;
set sashelp.heart(obs=100);
keep weight height;
run;
proc iml;
use have;
read all var _all_ into x[c=vname];
close;
rank=branks(x)[,3];
create rank var {rank};
append ;
close;
quit;
data temp;
merge have rank;
run;
proc rank data=temp out=want groups=20;
var rank;
ranks group;
run;
proc sort data=want;by group;run;
data want;
set want;
flag=ifn(group=0,0,1);
run;
proc freq data=want;
table flag/missing;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

This is just so sadly wrong that it is almost funny.

Without any other restriction you'll ALWAYS find 5% of records matching var1<x AND var2<y

Let's say 5% of records are 21 observations

order your dataset by var2

Set var1 to max of var1

count 21 observations and

Tadaaa you've got the result where var2 = value of var2 at obs 21+1

what bothers me is that you're trying to find a subgroup that reaches a significance level and that's just so wrong that it's difficult to know where to begin

Take sashelp.cars

one of the conditions you're looking for can be found at

where msrp LT $192,465 AND invoice LT$12,830

you get 21 obs that represent 5% of the 428 records in sashelp.cars

and that's just one of the results...

________________________

- Cheers -

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @DaveKerr and welcome to the SAS Support Communities!

In general, with two numeric variables (e.g., age and income) you will get a convex *curve* of eligible pairs of values (see PROC SGPLOT step in the code example below). So you'll need to define some sort of optimality criterion to determine the "optimal" point of the curve, e.g., minimize an objective function. Unless you have advanced tools (such as SAS/OR) available, you may need to categorize one of the two variables. Then you could obtain the relevant cutoff value for the other variable in each of the cumulative categories. In the case of age (assuming integer values, say, 18, 19, 20, ...), using the individual age values (as cutoffs) should be feasible. So you would determine the cutoff value for income in each of the groups age=18, age<=19, age<=20 and so on, i.e., the income X(age) such that the subset "income<=X(age)" in the respective group just comprises 5% of the entire dataset (which might exclude a few young age groups due to their size). Finally, apply the optimality criterion to those pairs of age and income cutoff values.

Below is an example. For simplicity I assume non-missing values for the relevant numeric variables (AGE and INCOME). In the presence of missing values you would need to exclude affected observations (since otherwise a condition such as age<=20 would also select observations with a missing age value).

```
/* Create sample data for demonstration */
data have;
call streaminit(3141592);
do i=1 to 100000;
age=rand('integer',18,65);
income=round(rand('lognormal',9.117+0.0531*age,age/240),0.01);
output;
end;
run;
/* Store 5% of the number of obs. in HAVE in macro variable N5P */
data _null_;
call symputx('n5p',ceil(0.05*n));
stop;
set have nobs=n;
run;
%put &=n5p;
/* Create view with income values of cumulative age groups */
proc sql;
create view vtemp as
select a.age, b.income
from (select distinct age from have a) left join have b
on b.age<=a.age
order by age, income;
quit;
/* Determine income cutoff value for each group (if &N5P obs. are available)
and compute objective function */
data opt;
do _n_=1 by 1 until(last.age);
set vtemp;
by age;
if _n_=&n5p then do;
c=log(income)**2+(age/20)**2; /* define your own objective function */
output;
end;
end;
run;
/* Optional: Visualize relationship between cutoff values */
proc sgplot data=opt;
series x=age y=income;
run;
/* Determine the combination of cutoff values minimizing the objective function */
proc summary data=opt;
var c;
output out=want(drop=_:) minid(c(age income))=opt_age opt_income;
run;
proc print data=want;
run;
/* Optional: Perform the selection */
data sel(drop=opt_:);
if _n_=1 then set want;
set have;
if age<=opt_age & income<=opt_income;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Actually very simple.

You should be able to run the following code as SASHELP.Class should be in your install as a training data set.

proc freq data=sashelp.class; tables age*sex/list; run;

The list option makes a one row per combination of the variables. The table resulting will have a cumulative percent. Scroll down the cumulative percent column until the percent shown is at least as large as you want. Then select all the values of the two variables as your criteria. With this set that would be Age=11 and Sex=F as this set is small but that combination is 5.26% of the data. This approach does have a bias on sort order of the values but since I'm not really understanding the need and you said " any combination of two fields" this fills the requirement easily. Also easy to extend and if you need "at least 5% in one group and 15% in another group" you could just go down the list. Age=11 and Sex=F plus Age=12 and Sex=F is 21.05-5.16 or about 16%

OR you could look at the PERCENT column to see if a single row satisfies your need.

Age | Sex | Frequency | Percent | Cumulative Frequency |
Cumulative Percent |
---|---|---|---|---|---|

11 | F | 1 | 5.26 | 1 | 5.26 |

11 | M | 1 | 5.26 | 2 | 10.53 |

12 | F | 2 | 10.53 | 4 | 21.05 |

12 | M | 3 | 15.79 | 7 | 36.84 |

13 | F | 2 | 10.53 | 9 | 47.37 |

13 | M | 1 | 5.26 | 10 | 52.63 |

14 | F | 2 | 10.53 | 12 | 63.16 |

14 | M | 2 | 10.53 | 14 | 73.68 |

15 | F | 2 | 10.53 | 16 | 84.21 |

15 | M | 2 | 10.53 | 18 | 94.74 |

16 | M | 1 | 5.26 | 19 | 100.00 |

@DaveKerr wrote:

Hoping this is even possible.

Essentially I have a table with several numeric customer data fields (eg credit score, income, age etc) and looking to layer any combination of two fields in order to create a population which accounts for at least 5% of the table.

For example customers with an age of less than 30 and an income of less than 25000 may account for - say - 3.5% of the table. My task is to therefore find a "sweet spot" of the two fields combined that would give me 5% of the table instead of 3.5%. So, perhaps age of less then 34 and income of less than 26000.

Question therefore is, how do I (evenly, if possible) scroll through each field and find that 5% sweet spot.

Any help hugely appreciated.

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.