Hello everyone,
I need to filter out specific IDs which lead to exceeding predefined thresholds. What I have so far:
data have; infile datalines delimiter= ','; input ID Factor $ Amount ; datalines; 1,a,100 2,b,200 3,c,300 4,a,400 5,a,500 6,c,600 7,c,700 8,b,800 9,a,900 10,b,1000 ; data max_table; infile datalines delimiter= ','; input Factor $ max_percentage ; datalines; a,0.3 b,0.4 c,0.25 ; PROC SQL; CREATE TABLE have_2 AS SELECT DISTINCT base.Factor ,base.amount_factor / sum(amount) AS Factor_distribution FORMAT = PERCENTN8.2 ,base.amount_factor FORMAT = COMMA20. ,sum(amount) AS amount_total FORMAT = COMMA20. ,max.max_percentage FORMAT = PERCENTN8.2 ,base.amount_factor - (sum(amount)*max.max_percentage) AS Difference FROM (SELECT Factor ,amount ,sum(amount) AS amount_factor FROM have GROUP BY Factor ) AS base LEFT JOIN max_table AS max ON base.factor = max.factor ;QUIT;
The table have_2 shows that the amounts for Factor a and c exceed the predefined maximum distribution.
For Factor a I have a maximum of 30% and it exceeds it by 4.55% or 250, and for Factor c it exceeds the maximum of 25% by 4.09% or 225.
I would need to build a Macro which runs once to figure out whether or not the thresholds are met, if yes the IDs with the highest amounts must be removed from the dataset and written to another dataset.
My final table should then only include the IDs which meet the threshholds.
In this example for Factor a, the ID=9 should be written into another table and for Factor c ID=7 should be removed from the dataset.
At the end I would need two tables, the first one including all IDs which meet the threshold and the other table all IDs which have been removed (if any).
Let me know if anything is unclear, and thank you very much for your help.
Best
John
Do you always just remove one record: the highest?
Note:
This post exemplifies exactly why I never use tabs to format code: you did a perfect job formatting it on your computer, and it looks like a mess here.
> I would need to sort the records and then remove the highest ones until the threshold is met.
This will never happen.
The sum of percentages is 100, and the sum of thresholds is less than 100. So each time you recalculate, some percentages will be above threshold.
Like this?
proc sql;
create table TOTAL as select sum(AMOUNT) as TOTAL from HAVE;
run;
proc sort data=HAVE out=SORTED;
by FACTOR AMOUNT;
run;
data WANT OUT;
if _N_=1 then set TOTAL;
merge SORTED
MAX_TABLE;
by FACTOR;
if first.FACTOR then SUM=0;
SUM+AMOUNT;
if SUM/TOTAL < MAX_PERCENTAGE then output WANT;
else output OUT;
run;
NOTE: The data set WORK.WANT has 8 observations and 6 variables.
NOTE: The data set WORK.OUT has 2 observations and 6 variables.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.