Here's what I'm trying to do:
Age
1
2
3
4
5
6
7
Get average of age. If average is less than 5, remove topmost row in column which is "1" and recalculate average again. If still less than 5, then remove "2". Repeat until average is more than 5 and then stop and output how many numbers were removed before achieving goal. Also output the ratio of #removed to total count.
Any help will be greatly appreciated. Thank you!
OK, so now it seems (that
Is the above correct? If so, then you need to read all ages for a group once to get the initial mean. Then reread them again to trim (and count) observations to adjust the mean if neccessary. At the end of the group's second pass you can calculate the ratio:
data want (keep=month total_count removed_count ratio);
do total_count=1 by 1 until (last.month);
set have;
by descending month;
age_sum=sum(age_sum,age);
end;
removed_count=0;
removed_sum=0;
do until (last.month);
set have;
by descending month;
if (age_sum-removed_sum)/(total_count-removed_count) <= -0.2 then do;
removed_sum=removed_sum+age;
removed_count=removed_count+1;
end;
end;
ratio=removed_count/total_count;
run;
Hi and welcome to the SAS communities 🙂
Where do you want the information to go? In a data set or to the log?
Here is an IML approach
data have;
input age @@;
datalines;
1 2 3 4 5 6 7
;
proc iml;
use have;
read all var {age};
close have;
do i=1 to nrow(age) until (avg>=5);
avg=mean(age[i:nrow(age)]);
end;
numsremoved=i-1;
ratio=numsremoved/nrow(age);
create want var {avg numsremoved ratio};
append;
close want;
quit;
Here is a data step solution which writes the information in the log:
data _null_;
array temp(100) 8 _temporary_;
do _N_=1 to nobs;
set have nobs=nobs;
temp(_N_)=age;
end;
do _N_=1 to nobs while(mean(of temp(*))<5);
call missing(temp(_N_));
end;
nums_removed=_N_-1;
ratio=nums_removed/nobs;
if ratio=1 then
put 'Average 5 was never reached!';
else
put nobs= nums_removed= ratio=;
run;
I made the array size 100 to accomodate larger datasets. If you have even more data than that, you will get an error message about invalid array index, in that case you can increase the number 100.
I used the _N_ variable for array indexing, as it is an automatic variable which is always present (so you do not have to drop it explicitly if you create a dataset).
Another approach, using a hash object:
data _null_;
declare hash inp (ordered:'yes');
declare hiter inpi('inp');
rc = inp.definedata('num','age');
rc = inp.definekey('num');
rc = inp.definedone();
num = 0;
do until (eof);
num + 1;
set have end = eof;
inp.add();
end;
removed = 0;
size = num;
newsize = size;
total = 0;
rc = inpi.first();
do while (rc = 0);
total + age;
rc = inpi.next();
end;
average = total / size;
do while (average < 5);
removed + 1;
num = removed;
rc = inp.remove();
total = 0;
newsize = 0;
rc = inpi.first();
do while (rc = 0);
newsize + 1;
total + age;
rc = inpi.next();
end;
average = total / newsize;
end;
rc = inp.output(dataset:"want");
put removed=;
percentage = removed / size;
put percentage=;
run;
The advantage of the hash object is that you do not need to make assumptions about the dataset size and it does everything with one sequential read and one sequential write. It is limited by the amount of memory available.
Hi @Luisyu,
If your dataset is much larger, e.g., like this
data have;
do _n_=1 to 100000;
age+ranuni(2718)/7000;
output;
end;
run;
you may want to make the iterative mean calculation more efficient:
proc summary data=have;
var age;
output out=stats n=n sum=s;
run;
data want(keep=n removed ratio);
set stats;
do k=n to 1 by -1;
set have;
avg=s/k;
if avg>=5 then leave;
s+(-age);
end;
removed=n-k;
ratio=removed/n;
if ratio<1 then output;
else put 'Average 5 was never reached!';
run;
(adopting @s_lassen's good idea to allow for the case that the target average cannot be reached).
If your dataset is sorted, then the task is really about determining the value of firstobs. For instance if the ages are 4,5,5,5 then you would want
data want;
set have (firstobs=2);
run;
If the maximum age <5 then you would want a message to that effect in the sas log.
And if the minimum age is 5 or more, then you don't need to examine the rest of HAVE to know that "firstobs=1" satisfies your objective (but I didn't include this operation in the code below):
data have;
input age @@;
datalines;
1 2 3 4 5 5 5 5
run;
data _null_;
array n {0:4} _temporary_ (5*0);
set have end=end_of_have;
if age<5 then n{age}+1;
sum_age+age;
if end_of_have;
total_n=_n_;
if age<5 then do;
putlog 'Mean age >= 5 not possible. Highest ' age=;
stop;
end;
age=lbound(n);
if sum_age/total_n<5 then do until(sum_age/total_n>=5);
if n{age}=0 then age=age+1;
else do;
sum_age=sum_age-age;
total_n=total_n-1;
n{age}=n{age}-1;
end;
end;
fobs=1 + _n_-total_n;
call execute (cats('data want; set have (firstobs=',fobs,');run;'));
run;
Really appreciate all the feedback. Let me further clarify what I'm looking for.
I have two columns, month and age, with month column already pre-sorted from 201812~201801.
Age column is also already pre-sorted from smallest number to biggest number.
Below is a short clip of my data set:
MONTH | Age (normalized) |
201812 | -7.60 |
201812 | -7.17 |
201812 | -7.14 |
201812 | -6.05 |
201812 | -4.85 |
201812 | -4.67 |
201812 | -4.67 |
201812 | -4.23 |
201812 | -4.15 |
201812 | -3.99 |
201812 | -3.50 |
201812 | -3.07 |
201812 | -3.01 |
201812 | -2.48 |
201812 | -2.19 |
I need a program which will read in the data set above and calculate average age by month.
However, if monthly average age is smaller or equal to -0.2, it will remove the topmost data (-7.60) and recalculate the monthly average age. If it is still < or = to -0.2, it will continue removing the next topmost data (-7.17) and recalculate until finally the monthly average age >-0.2. It will then output results like below to new data set. So for example, in Dec'2018 if there are a total of 100 counts of age and 10 of the smallest numbers (10 topmost rows in age column) in that month need to be removed to achieve monthly average age >-0.2, it will output below data with ratio = 0.1 (10 out of 100). So basically, the program will fill in all the data below.
MONTH | Total Count | Removed Count | Ratio |
201812 | 100 | 10 | 0.1 |
201811 |
|
|
|
201810 |
|
|
|
201809 |
|
|
|
201808 |
|
|
|
201807 |
|
|
|
201806 |
|
|
|
201805 |
|
|
|
201804 |
|
|
|
201803 |
|
|
|
201802 |
|
|
|
201801 |
|
|
|
Hope this is clear enough. Thank you.
OK, so now it seems (that
Is the above correct? If so, then you need to read all ages for a group once to get the initial mean. Then reread them again to trim (and count) observations to adjust the mean if neccessary. At the end of the group's second pass you can calculate the ratio:
data want (keep=month total_count removed_count ratio);
do total_count=1 by 1 until (last.month);
set have;
by descending month;
age_sum=sum(age_sum,age);
end;
removed_count=0;
removed_sum=0;
do until (last.month);
set have;
by descending month;
if (age_sum-removed_sum)/(total_count-removed_count) <= -0.2 then do;
removed_sum=removed_sum+age;
removed_count=removed_count+1;
end;
end;
ratio=removed_count/total_count;
run;
One question. If instead of calculating mean, I want to calculate the standard deviation, how do I modify the program? Thanks.
The sample mean is sum(of sample values) divided by sample count. The only issue in this problem is to properly remove elements from the sum of sample values, and reduce sample count correspondingly.
The std can be the same. But you have to also track the sum of (squared sample values). Given the formula for the sample std is the square root of the sample variance, start with the sample variance formula:
So, in addition to accumulating sum of the values (used for means and also for the right hand term above), you need to also accumulate sum of the squared values (for the left hand term), and perform the same REMOVE tracking as for the mean. Once the removal totals are complete, just generate the sample variance above and get its square root.
The reason I put the formula for STD - especially the 2nd variation of the formula - was to formulate it in a way that fits perfectly into the program structure I presented for calculating the mean.
I will not be writing the program, which is just a way of showing you do not yet understand the logic of the program I already provided. Here's a clue: Every place you see AGE_SUM add a parallel statement for AGESQUARE_SUM. Same thing with REMOVED_SUM.
Then where the ratio is calculated, you will have all the variable needed for generating STD.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.