BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Luisyu
Calcite | Level 5

Here's what I'm trying to do:

 

Age

1

2

3

4

5

6

7

 

Get average of age. If average is less than 5, remove topmost row in column which is "1" and recalculate average again. If still less than 5, then remove "2". Repeat until average is more than 5 and then stop and output how many numbers were removed before achieving goal. Also output the ratio of #removed to total count.

 

Any help will be greatly appreciated. Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

OK, so now it seems (that

  1. data is sorted by descending date / ascending age
  2. you want to do this process for each month
  3. ages need not be integer values

Is the above correct?  If so, then you need to read all ages for a group once to get the initial mean.  Then reread them again to trim (and count) observations to adjust the mean if neccessary.  At the end of the group's second pass you can calculate the ratio:

 

data want (keep=month total_count removed_count ratio);
  do total_count=1 by 1 until (last.month);
    set have;
    by descending month;
    age_sum=sum(age_sum,age);
  end;

  removed_count=0;
  removed_sum=0;
  do until (last.month);
    set have;
    by descending month;
    if  (age_sum-removed_sum)/(total_count-removed_count) <= -0.2 then do;
      removed_sum=removed_sum+age;
      removed_count=removed_count+1;
    end;
  end;
  ratio=removed_count/total_count;
run;

 

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
PeterClemmensen
Tourmaline | Level 20

Hi and welcome to the SAS communities 🙂

 

Where do you want the information to go? In a data set or to the log?

PeterClemmensen
Tourmaline | Level 20

Here is an IML approach

 

data have;
input age @@;
datalines;
1 2 3 4 5 6 7
;

proc iml;
use have;
   read all var {age};
close have;

do i=1 to nrow(age) until (avg>=5);
   avg=mean(age[i:nrow(age)]);
end;

numsremoved=i-1;
ratio=numsremoved/nrow(age);

create want var {avg numsremoved ratio};
   append;
close want;

quit;
s_lassen
Meteorite | Level 14

Here is a data step solution which writes the information in the log:

data _null_;
  array temp(100) 8 _temporary_;
  do _N_=1 to nobs;
    set have nobs=nobs;
    temp(_N_)=age;
    end;
  do _N_=1 to nobs while(mean(of temp(*))<5);
    call missing(temp(_N_));
    end;
  nums_removed=_N_-1;
  ratio=nums_removed/nobs;
  if ratio=1 then
    put 'Average 5 was never reached!';
  else
    put nobs= nums_removed= ratio=;
run;

I made the array size 100 to accomodate larger datasets. If you have even more data than that, you will get an error message about invalid array index, in that case you can increase the number 100.

I used the _N_ variable for array indexing, as it is an automatic variable which is always present (so you do not have to drop it explicitly if you create a dataset).

Kurt_Bremser
Super User

Another approach, using a hash object:

data _null_;
declare hash inp (ordered:'yes');
declare hiter inpi('inp');
rc = inp.definedata('num','age');
rc = inp.definekey('num');
rc = inp.definedone();
num = 0;
do until (eof);
  num + 1;
  set have end = eof;
  inp.add();
end;
removed = 0;
size = num;
newsize = size;
total = 0;
rc = inpi.first();
do while (rc = 0);
  total + age;
  rc = inpi.next();
end;
average = total / size;
do while (average < 5);
  removed + 1;
  num = removed;
  rc = inp.remove();
  total = 0;
  newsize = 0;
  rc = inpi.first();
  do while (rc = 0);
    newsize + 1;    
    total + age;
    rc = inpi.next();
  end;
  average = total / newsize;
end;
rc = inp.output(dataset:"want");
put removed=;
percentage = removed / size;
put percentage=;
run;

The advantage of the hash object is that you do not need to make assumptions about the dataset size and it does everything with one sequential read and one sequential write. It is limited by the amount of memory available.

FreelanceReinh
Jade | Level 19

Hi @Luisyu,

 

If your dataset is much larger, e.g., like this

data have;
do _n_=1 to 100000;
  age+ranuni(2718)/7000;
  output;
end;
run;

you may want to make the iterative mean calculation more efficient:

proc summary data=have;
var age;
output out=stats n=n sum=s;
run;

data want(keep=n removed ratio);
set stats;
do k=n to 1 by -1;
  set have;
  avg=s/k;
  if avg>=5 then leave;
  s+(-age);
end;
removed=n-k;
ratio=removed/n;
if ratio<1 then output;
else put 'Average 5 was never reached!';
run;

(adopting @s_lassen's good idea to allow for the case that the target average cannot be reached).

mkeintz
PROC Star

If your dataset is sorted, then the task is really about determining the value of firstobs.  For instance if the ages are 4,5,5,5 then you would want

data want;
  set have (firstobs=2);
run;

If the maximum age <5 then you would want a message to that effect in the sas log.

 

 

And if the minimum age is 5 or more, then you don't need to examine the rest of HAVE to know that "firstobs=1" satisfies your objective (but I didn't include this operation in the code below):

data have;
  input age @@;
datalines;
1 2 3 4 5 5 5 5
run;

data _null_;

  array n {0:4} _temporary_ (5*0);

  set have end=end_of_have;

  if age<5 then n{age}+1;
  sum_age+age;

  if end_of_have;
  total_n=_n_;

  if age<5 then do;
    putlog 'Mean age >= 5 not possible. Highest ' age=;
    stop;
  end;

  age=lbound(n);
  if sum_age/total_n<5 then do until(sum_age/total_n>=5);
    if n{age}=0 then age=age+1;
    else do;
      sum_age=sum_age-age;
      total_n=total_n-1;
      n{age}=n{age}-1;
    end;
  end;
  fobs=1 + _n_-total_n;

  call execute (cats('data want; set have (firstobs=',fobs,');run;'));
run;

  1. A temporary array N is kept of the frequencies of each age <5, initialized to all zeroes.  This will provide a way to determine how many of the lowest AGE values can be kept while generating mean>=5.  You can set the lower and upper bounds of N as needed.
  2.  The subsetting if statement "IF end_of_have;" allows subsequent statement to be executed only once, after all of HAVE has been read.
  3.  The following "if age<5" provides a check on whether there are ANY ages >=5.
  4.  The "if ... then do until ..." loop allows a way to count how many age=1, then age=2, then age=3, etc. need to be skipped to produce mean age?=5.
  5.  The call execute statement constructs the next data step, with properly calculated FIRSTOBS=.
  6.  Note if there is no age>=5 then the next data step is not constructed.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Luisyu
Calcite | Level 5

Really appreciate all the feedback. Let me further clarify what I'm looking for.

I have two columns, month and age, with month column already pre-sorted from 201812~201801.

Age column is also already pre-sorted from smallest number to biggest number.

Below is a short clip of my data set:

MONTH

Age (normalized)

201812

-7.60

201812

-7.17

201812

-7.14

201812

-6.05

201812

-4.85

201812

-4.67

201812

-4.67

201812

-4.23

201812

-4.15

201812

-3.99

201812

-3.50

201812

-3.07

201812

-3.01

201812

-2.48

201812

-2.19

I need a program which will read in the data set above and calculate average age by month.

However, if monthly average age is smaller or equal to -0.2, it will remove the topmost data (-7.60) and recalculate the monthly average age. If it is still < or = to  -0.2, it will continue removing the next topmost data (-7.17) and recalculate until finally the monthly average age >-0.2. It will then output results like below to new data set. So for example, in Dec'2018 if there are a total of 100 counts of age and 10 of the smallest numbers (10 topmost rows in age column) in that month need to be removed to achieve monthly average age >-0.2, it will output below data with ratio = 0.1 (10 out of 100). So basically, the program will fill in all the data below.

MONTH

Total Count

Removed Count

Ratio

201812

100

10

0.1

201811

 

 

 

201810

 

 

 

201809

 

 

 

201808

 

 

 

201807

 

 

 

201806

 

 

 

201805

 

 

 

201804

 

 

 

201803

 

 

 

201802

 

 

 

201801

 

 

 

 

Hope this is clear enough. Thank you.

 

mkeintz
PROC Star

OK, so now it seems (that

  1. data is sorted by descending date / ascending age
  2. you want to do this process for each month
  3. ages need not be integer values

Is the above correct?  If so, then you need to read all ages for a group once to get the initial mean.  Then reread them again to trim (and count) observations to adjust the mean if neccessary.  At the end of the group's second pass you can calculate the ratio:

 

data want (keep=month total_count removed_count ratio);
  do total_count=1 by 1 until (last.month);
    set have;
    by descending month;
    age_sum=sum(age_sum,age);
  end;

  removed_count=0;
  removed_sum=0;
  do until (last.month);
    set have;
    by descending month;
    if  (age_sum-removed_sum)/(total_count-removed_count) <= -0.2 then do;
      removed_sum=removed_sum+age;
      removed_count=removed_count+1;
    end;
  end;
  ratio=removed_count/total_count;
run;

 

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Luisyu
Calcite | Level 5

One question. If instead of calculating mean, I want to calculate the standard deviation, how do I  modify the program? Thanks.

mkeintz
PROC Star

The sample mean is   sum(of sample values) divided by sample count.  The only issue in this problem is to properly remove elements from the sum of sample values, and reduce sample count correspondingly.

 

The std can be the same.  But you have to also track the sum of (squared sample values).  Given the formula for the sample std is the square root of the sample variance, start with the sample variance formula:

 

 

sample_variance.PNG

 

 

So, in addition to accumulating sum of the values (used for means and also for the right hand term above), you need to also accumulate sum of the squared values (for the left hand term), and perform the same REMOVE tracking as for the mean.  Once the removal totals are complete, just generate the sample variance above and get its square root.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Luisyu
Calcite | Level 5
I know the formula for stdev but I'm having a hard time coding the remove tracking for the stdev into the program in addition to the mean. Can you please modify your program above to include the stdev? Appreciate it!
mkeintz
PROC Star

The reason I put the formula for STD - especially the 2nd variation of the formula - was to formulate it in a way that fits perfectly into the program structure I presented for calculating the mean.

 

I will not be writing the program, which is just a way of showing you do not yet understand the logic of the program I already provided.  Here's a clue:   Every place you see AGE_SUM add a parallel statement for AGESQUARE_SUM.  Same thing with REMOVED_SUM.

 

Then where the ratio is calculated, you will have all the variable needed for generating STD.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1844 views
  • 3 likes
  • 6 in conversation