DATA Step, Macro, Functions and more

If-then do Repeat until Condition Met

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

If-then do Repeat until Condition Met

Hopefully this is the right location...I've done a decent amount of digging, but can't seem to find any similar questions.

 

I'm using SAS Enterprise Guide 7.1...I'm trying to clean up a fairly large dataset that is calculating the harmonic mean and standard deviation of 24 hour variables (each beginning with Spd_). I would like to write the script in a way that reviews the 24 bins if the standard deviation is greater than 10, drops the largest value, re-calculates the standard deviation and mean, and continues to test each record until the standard deviation for each is below 10. The basic form that I've been trying to use is below, but it doesn't actually work. Is it possible to do this and if so, how? Thank you!

 

I've also attached an example of one record and what should happen with each iteration until the record no longer meets the review condition (the attachment is the excel file of the image posted below...I wasn't sure if the image would be expandable or not).

 

data work.analysis;

   set work.analysis_set;

   if StDev>10 then

      do;

      largest(1, of Spd_Smiley Happy=.;

      StDev=std(of Spd_Smiley Happy;

      Mean=harmean(of Spd_Smiley Wink;

   end;

   repeat...

 

i.e.

Spd_CalculationExample.JPG


Accepted Solutions
Solution
‎12-28-2017 03:01 PM
PROC Star
Posts: 8,167

Re: If-then do Repeat until Condition Met

Posted in reply to trphelps15

You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:

data analysis;
  set analysis_set;
  array set(24) Spd_:;
  do while (StDev>10);
    x=largest(1, of set(*));
    y=whichn(x,of set(*));
    call missing (set(y));
    StDev=std(of set(*));
    Mean=harmean(of set(*));
  end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Solution
‎12-28-2017 03:01 PM
PROC Star
Posts: 8,167

Re: If-then do Repeat until Condition Met

Posted in reply to trphelps15

You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:

data analysis;
  set analysis_set;
  array set(24) Spd_:;
  do while (StDev>10);
    x=largest(1, of set(*));
    y=whichn(x,of set(*));
    call missing (set(y));
    StDev=std(of set(*));
    Mean=harmean(of set(*));
  end;
run;

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 6

Re: If-then do Repeat until Condition Met

Thank you, both of the answers provided were able to accomplish what I was looking for!

 

Unfortunately I'll have to go back to the drawing board as far as my cleanup process goes. After reviewing the results, I understand that I've artificially deflated my overall values and should be eliminating values at both ends of the spectrum (largest AND smallest values), as well as modifying my threshold further. These responses though should both provide me the groundwork to update as I further improve this specific dataset's scrubbing process.

 

Any advice for modifying the script to test both the largest and smallest values? I'm thinking a comparison to the mean...utilizing an absolute value of subtracting the two values and then removing the value that has the largest difference essentially.

 

Thank you both Art and Quentin!

PROC Star
Posts: 1,460

Re: If-then do Repeat until Condition Met

Posted in reply to trphelps15

I didn't test it much, but you could make an array with the differences (residuals?) of each value from the mean.  And us then find the largest(difference) rather than largest (value).  Something like:

 

data have (keep=id Spd:);
 array Spd {*} Spd0-Spd4;
 do id=1 to 5;
   do i=1 to dim(Spd);
     spd{i}=100+sqrt(25)*rannor(3);
   end;
   output;
 end;
run;

data want; 
  set have;

  array Spd{*} Spd: ;

  array Difs{5} ; *Hard-coded number of elements in the Spd array! ;

  iter=1;
  n=n(of Spd{*});
  sd=std(of Spd{*});
  mean=harmean(of Spd{*});
  output;

  do while( (sd > .2) and (n>2) );
    iter=iter+1;

    *compute the difference between each value and the mean;
    do i=1 to dim(Difs);
      if missing(Spd{i}) then difs{i}=.;
      else difs{i}=abs(Spd{i}-mean);
    end;

    *get index of variable with largest value in the array of differences;
    largest=whichn(largest(1,of Difs{*}),of Difs{*});

    *Set the largest value to missing;
    Spd{largest}=.; 

    n=n(of Spd{*});
    sd=std(of Spd{*});
    mean=harmean(of Spd{*});
    output;   

  end;

run;

options ls=max nocenter;
proc print data=want;
 var id iter Spd: n sd mean largest difs:;
run;
PROC Star
Posts: 8,167

Re: If-then do Repeat until Condition Met

Posted in reply to trphelps15

You could just add the "smallest" function to your code. e.g.:

data analysis;
  set analysis_set;
  array set(24) Spd_:;
  do while (StDev>10);
    call missing (set(whichn(smallest(1, of set(*)),of set(*))));
    call missing (set(whichn(largest(1, of set(*)),of set(*))));
    StDev=std(of set(*));
    Mean=harmean(of set(*));
  end;
run;

Art, CEO, AnalystFinder.com

 

PROC Star
Posts: 1,460

Re: If-then do Repeat until Condition Met

Posted in reply to trphelps15

I don't do this kind of stuff much, but since it's a quiet day here, thought I'd throw out a suggestion that hopefully gives you an approach to consider.

 

Below uses arrays.  After a record is read, it computes the mean and sd.  Then uses a DO WHILE loop so that if the standard deviation is greater than a threshold, the inside of the loop will execute.  Inside the loop it uses the LARGEST function to find the largest value in the array, and the WHICHN function to find the index of that variable (i.e. which position of the array it is in).  Then it sets that variable to missing, and recomputes the mean and SD.  I added a criterion of n>2 so that it would not attempt to compute the standard deviation of one value, which would throw an error.  With this approach, it's possible that the final standard deviation could be greater than your threshold.

 

data have (keep=id Spd:);
 array Spd {*} Spd0-Spd4;
 do id=1 to 5;
   do i=1 to dim(Spd);
     spd{i}=ranuni(3);
   end;
   output;
 end;
run;

data want; 
  set have;

  array Spd{*} Spd: ;

  iter=1;
  n=n(of Spd{*});
  sd=std(of Spd{*});
  mean=harmean(of Spd{*});
  output;

  do while( (sd > .2) and (n>2) );
    iter=iter+1;

    *get index of variable with largest value in the array;
    largest=whichn(largest(1,of Spd{*}),of Spd{*});

    *Set the largest value to missing;
    Spd{largest}=.; 

    n=n(of Spd{*});
    sd=std(of Spd{*});
    mean=harmean(of Spd{*});
    output;   
  end;

run;

options ls=max nocenter;
proc print data=want;
 var id iter Spd: n sd mean largest;
run;

 

Output like:

id    iter      Spd0       Spd1       Spd2       Spd3       Spd4     n       sd        mean     largest

 1      1     0.55489    0.91027    0.19947    0.77820    0.76481    5    0.27795    0.47590       .
 1      2     0.55489     .         0.19947    0.77820    0.76481    4    0.27002    0.42517       2
 1      3     0.55489     .         0.19947     .         0.76481    3    0.28577    0.36933       4
 1      4     0.55489     .         0.19947     .          .         2    0.25132    0.29345       5
 2      1     0.90783    0.62894    0.59508    0.14938    0.19970    5    0.31821    0.31107       .
 2      2      .         0.62894    0.59508    0.14938    0.19970    4    0.25378    0.26716       1
 2      3      .          .         0.59508    0.14938    0.19970    3    0.24410    0.22418       2
 2      4      .          .          .         0.14938    0.19970    2    0.03558    0.17091       3
 3      1     0.45796    0.57161    0.56018    0.20155    0.87107    5    0.24084    0.42274       .
 3      2     0.45796    0.57161    0.56018    0.20155     .         4    0.17195    0.37454       5
 4      1     0.89158    0.81784    0.06979    0.93029    0.67952    5    0.35311    0.26015       .
 4      2     0.89158    0.81784    0.06979     .         0.67952    4    0.37375    0.22045       4
 4      3      .         0.81784    0.06979     .         0.67952    3    0.39801    0.17623       1
 4      4      .          .         0.06979     .         0.67952    2    0.43115    0.12658       2
 5      1     0.06471    0.23829    0.67566    0.86168    0.42737    5    0.32176    0.20300       .
 5      2     0.06471    0.23829    0.67566     .         0.42737    4    0.26198    0.17043       4
 5      3     0.06471    0.23829     .          .         0.42737    3    0.18138    0.13643       3

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 199 views
  • 1 like
  • 3 in conversation