Solved: How to loop calculating avg of column, then remove topmost data, then ...

Luisyu · Posted 01-28-2019 02:03 AM

Here's what I'm trying to do:

Age

1

2

3

4

5

6

7

Get average of age. If average is less than 5, remove topmost row in column which is "1" and recalculate average again. If still less than 5, then remove "2". Repeat until average is more than 5 and then stop and output how many numbers were removed before achieving goal. Also output the ratio of #removed to total count.

Any help will be greatly appreciated. Thank you!

mkeintz · Posted 01-28-2019 07:37 PM

OK, so now it seems (that

data is sorted by descending date / ascending age
you want to do this process for each month
ages need not be integer values

Is the above correct? If so, then you need to read all ages for a group once to get the initial mean. Then reread them again to trim (and count) observations to adjust the mean if neccessary. At the end of the group's second pass you can calculate the ratio:

data want (keep=month total_count removed_count ratio);
  do total_count=1 by 1 until (last.month);
    set have;
    by descending month;
    age_sum=sum(age_sum,age);
  end;

  removed_count=0;
  removed_sum=0;
  do until (last.month);
    set have;
    by descending month;
    if  (age_sum-removed_sum)/(total_count-removed_count) <= -0.2 then do;
      removed_sum=removed_sum+age;
      removed_count=removed_count+1;
    end;
  end;
  ratio=removed_count/total_count;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PeterClemmensen · Posted 01-28-2019 03:06 AM

Hi and welcome to the SAS communities 🙂

Where do you want the information to go? In a data set or to the log?

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 01-28-2019 03:17 AM

Here is an IML approach

data have;
input age @@;
datalines;
1 2 3 4 5 6 7
;

proc iml;
use have;
   read all var {age};
close have;

do i=1 to nrow(age) until (avg>=5);
   avg=mean(age[i:nrow(age)]);
end;

numsremoved=i-1;
ratio=numsremoved/nrow(age);

create want var {avg numsremoved ratio};
   append;
close want;

quit;

The DATA to DATA Step Macro
Blog: SASnrd

s_lassen · Posted 01-28-2019 03:48 AM

Here is a data step solution which writes the information in the log:

data _null_;
  array temp(100) 8 _temporary_;
  do _N_=1 to nobs;
    set have nobs=nobs;
    temp(_N_)=age;
    end;
  do _N_=1 to nobs while(mean(of temp(*))<5);
    call missing(temp(_N_));
    end;
  nums_removed=_N_-1;
  ratio=nums_removed/nobs;
  if ratio=1 then
    put 'Average 5 was never reached!';
  else
    put nobs= nums_removed= ratio=;
run;

I made the array size 100 to accomodate larger datasets. If you have even more data than that, you will get an error message about invalid array index, in that case you can increase the number 100.

I used the _N_ variable for array indexing, as it is an automatic variable which is always present (so you do not have to drop it explicitly if you create a dataset).

Kurt_Bremser · Posted 01-28-2019 04:54 AM

Another approach, using a hash object:

data _null_;
declare hash inp (ordered:'yes');
declare hiter inpi('inp');
rc = inp.definedata('num','age');
rc = inp.definekey('num');
rc = inp.definedone();
num = 0;
do until (eof);
  num + 1;
  set have end = eof;
  inp.add();
end;
removed = 0;
size = num;
newsize = size;
total = 0;
rc = inpi.first();
do while (rc = 0);
  total + age;
  rc = inpi.next();
end;
average = total / size;
do while (average < 5);
  removed + 1;
  num = removed;
  rc = inp.remove();
  total = 0;
  newsize = 0;
  rc = inpi.first();
  do while (rc = 0);
    newsize + 1;    
    total + age;
    rc = inpi.next();
  end;
  average = total / newsize;
end;
rc = inp.output(dataset:"want");
put removed=;
percentage = removed / size;
put percentage=;
run;

The advantage of the hash object is that you do not need to make assumptions about the dataset size and it does everything with one sequential read and one sequential write. It is limited by the amount of memory available.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 01-28-2019 06:03 AM

Hi @Luisyu,

If your dataset is much larger, e.g., like this

data have;
do _n_=1 to 100000;
  age+ranuni(2718)/7000;
  output;
end;
run;

you may want to make the iterative mean calculation more efficient:

proc summary data=have;
var age;
output out=stats n=n sum=s;
run;

data want(keep=n removed ratio);
set stats;
do k=n to 1 by -1;
  set have;
  avg=s/k;
  if avg>=5 then leave;
  s+(-age);
end;
removed=n-k;
ratio=removed/n;
if ratio<1 then output;
else put 'Average 5 was never reached!';
run;

(adopting @s_lassen's good idea to allow for the case that the target average cannot be reached).

mkeintz · Posted 01-28-2019 01:05 PM

If your dataset is sorted, then the task is really about determining the value of firstobs. For instance if the ages are 4,5,5,5 then you would want

data want;
  set have (firstobs=2);
run;

If the maximum age <5 then you would want a message to that effect in the sas log.

And if the minimum age is 5 or more, then you don't need to examine the rest of HAVE to know that "firstobs=1" satisfies your objective (but I didn't include this operation in the code below):

data have;
  input age @@;
datalines;
1 2 3 4 5 5 5 5
run;

data _null_;

  array n {0:4} _temporary_ (5*0);

  set have end=end_of_have;

  if age<5 then n{age}+1;
  sum_age+age;

  if end_of_have;
  total_n=_n_;

  if age<5 then do;
    putlog 'Mean age >= 5 not possible. Highest ' age=;
    stop;
  end;

  age=lbound(n);
  if sum_age/total_n<5 then do until(sum_age/total_n>=5);
    if n{age}=0 then age=age+1;
    else do;
      sum_age=sum_age-age;
      total_n=total_n-1;
      n{age}=n{age}-1;
    end;
  end;
  fobs=1 + _n_-total_n;

  call execute (cats('data want; set have (firstobs=',fobs,');run;'));
run;

A temporary array N is kept of the frequencies of each age <5, initialized to all zeroes. This will provide a way to determine how many of the lowest AGE values can be kept while generating mean>=5. You can set the lower and upper bounds of N as needed.
The subsetting if statement "IF end_of_have;" allows subsequent statement to be executed only once, after all of HAVE has been read.
The following "if age<5" provides a check on whether there are ANY ages >=5.
The "if ... then do until ..." loop allows a way to count how many age=1, then age=2, then age=3, etc. need to be skipped to produce mean age?=5.
The call execute statement constructs the next data step, with properly calculated FIRSTOBS=.
Note if there is no age>=5 then the next data step is not constructed.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Luisyu · Posted 01-28-2019 06:15 PM

Really appreciate all the feedback. Let me further clarify what I'm looking for.

I have two columns, month and age, with month column already pre-sorted from 201812~201801.

Age column is also already pre-sorted from smallest number to biggest number.

Below is a short clip of my data set:

MONTH	Age (normalized)
201812	-7.60
201812	-7.17
201812	-7.14
201812	-6.05
201812	-4.85
201812	-4.67
201812	-4.67
201812	-4.23
201812	-4.15
201812	-3.99
201812	-3.50
201812	-3.07
201812	-3.01
201812	-2.48
201812	-2.19

I need a program which will read in the data set above and calculate average age by month.

However, if monthly average age is smaller or equal to -0.2, it will remove the topmost data (-7.60) and recalculate the monthly average age. If it is still < or = to -0.2, it will continue removing the next topmost data (-7.17) and recalculate until finally the monthly average age >-0.2. It will then output results like below to new data set. So for example, in Dec'2018 if there are a total of 100 counts of age and 10 of the smallest numbers (10 topmost rows in age column) in that month need to be removed to achieve monthly average age >-0.2, it will output below data with ratio = 0.1 (10 out of 100). So basically, the program will fill in all the data below.

MONTH	Total Count	Removed Count	Ratio
201812	100	10	0.1
201811
201810
201809
201808
201807
201806
201805
201804
201803
201802
201801

Hope this is clear enough. Thank you.

mkeintz · Posted 01-28-2019 07:37 PM

OK, so now it seems (that

data is sorted by descending date / ascending age
you want to do this process for each month
ages need not be integer values

Is the above correct? If so, then you need to read all ages for a group once to get the initial mean. Then reread them again to trim (and count) observations to adjust the mean if neccessary. At the end of the group's second pass you can calculate the ratio:

data want (keep=month total_count removed_count ratio);
  do total_count=1 by 1 until (last.month);
    set have;
    by descending month;
    age_sum=sum(age_sum,age);
  end;

  removed_count=0;
  removed_sum=0;
  do until (last.month);
    set have;
    by descending month;
    if  (age_sum-removed_sum)/(total_count-removed_count) <= -0.2 then do;
      removed_sum=removed_sum+age;
      removed_count=removed_count+1;
    end;
  end;
  ratio=removed_count/total_count;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Luisyu · Posted 01-29-2019 02:08 AM

It works perfectly. Thank you!

Luisyu · Posted 01-31-2019 12:39 AM

One question. If instead of calculating mean, I want to calculate the standard deviation, how do I modify the program? Thanks.

mkeintz · Posted 01-31-2019 03:00 PM

The sample mean is sum(of sample values) divided by sample count. The only issue in this problem is to properly remove elements from the sum of sample values, and reduce sample count correspondingly.

The std can be the same. But you have to also track the sum of (squared sample values). Given the formula for the sample std is the square root of the sample variance, start with the sample variance formula:

So, in addition to accumulating sum of the values (used for means and also for the right hand term above), you need to also accumulate sum of the squared values (for the left hand term), and perform the same REMOVE tracking as for the mean. Once the removal totals are complete, just generate the sample variance above and get its square root.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Luisyu · Posted 01-31-2019 07:02 PM

I know the formula for stdev but I'm having a hard time coding the remove tracking for the stdev into the program in addition to the mean. Can you please modify your program above to include the stdev? Appreciate it!

mkeintz · Posted 01-31-2019 07:53 PM

The reason I put the formula for STD - especially the 2nd variation of the formula - was to formulate it in a way that fits perfectly into the program structure I presented for calculating the mean.

I will not be writing the program, which is just a way of showing you do not yet understand the logic of the program I already provided. Here's a clue: Every place you see AGE_SUM add a parallel statement for AGESQUARE_SUM. Same thing with REMOVED_SUM.

Then where the ratio is calculated, you will have all the variable needed for generating STD.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Re: How to loop calculating avg of column, then remove topmost data, then recalculate avg again?

Classroom Training Available!