turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- If-then do Repeat until Condition Met

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-28-2017 10:19 AM

Hopefully this is the right location...I've done a decent amount of digging, but can't seem to find any similar questions.

I'm using SAS Enterprise Guide 7.1...I'm trying to clean up a fairly large dataset that is calculating the harmonic mean and standard deviation of 24 hour variables (each beginning with Spd_). I would like to write the script in a way that reviews the 24 bins if the standard deviation is greater than 10, drops the largest value, re-calculates the standard deviation and mean, and continues to test each record until the standard deviation for each is below 10. The basic form that I've been trying to use is below, but it doesn't actually work. Is it possible to do this and if so, how? Thank you!

I've also attached an example of one record and what should happen with each iteration until the record no longer meets the review condition (the attachment is the excel file of the image posted below...I wasn't sure if the image would be expandable or not).

data work.analysis;

set work.analysis_set;

if StDev>10 then

do;

largest(1, of Spd_=.;

StDev=std(of Spd_;

Mean=harmean(of Spd_;

end;

repeat...

i.e.

Accepted Solutions

Solution

12-28-2017
03:01 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to trphelps15

12-28-2017 11:37 AM

You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:

data analysis; set analysis_set; array set(24) Spd_:; do while (StDev>10); x=largest(1, of set(*)); y=whichn(x,of set(*)); call missing (set(y)); StDev=std(of set(*)); Mean=harmean(of set(*)); end; run;

Art, CEO, AnalystFinder.com

All Replies

Solution

12-28-2017
03:01 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to trphelps15

12-28-2017 11:37 AM

You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:

data analysis; set analysis_set; array set(24) Spd_:; do while (StDev>10); x=largest(1, of set(*)); y=whichn(x,of set(*)); call missing (set(y)); StDev=std(of set(*)); Mean=harmean(of set(*)); end; run;

Art, CEO, AnalystFinder.com

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to art297

12-28-2017 03:09 PM

Thank you, both of the answers provided were able to accomplish what I was looking for!

Unfortunately I'll have to go back to the drawing board as far as my cleanup process goes. After reviewing the results, I understand that I've artificially deflated my overall values and should be eliminating values at both ends of the spectrum (largest AND smallest values), as well as modifying my threshold further. These responses though should both provide me the groundwork to update as I further improve this specific dataset's scrubbing process.

Any advice for modifying the script to test both the largest and smallest values? I'm thinking a comparison to the mean...utilizing an absolute value of subtracting the two values and then removing the value that has the largest difference essentially.

Thank you both Art and Quentin!

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to trphelps15

12-28-2017 03:41 PM

I didn't test it much, but you could make an array with the differences (residuals?) of each value from the mean. And us then find the largest(difference) rather than largest (value). Something like:

```
data have (keep=id Spd:);
array Spd {*} Spd0-Spd4;
do id=1 to 5;
do i=1 to dim(Spd);
spd{i}=100+sqrt(25)*rannor(3);
end;
output;
end;
run;
data want;
set have;
array Spd{*} Spd: ;
array Difs{5} ; *Hard-coded number of elements in the Spd array! ;
iter=1;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
do while( (sd > .2) and (n>2) );
iter=iter+1;
*compute the difference between each value and the mean;
do i=1 to dim(Difs);
if missing(Spd{i}) then difs{i}=.;
else difs{i}=abs(Spd{i}-mean);
end;
*get index of variable with largest value in the array of differences;
largest=whichn(largest(1,of Difs{*}),of Difs{*});
*Set the largest value to missing;
Spd{largest}=.;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
end;
run;
options ls=max nocenter;
proc print data=want;
var id iter Spd: n sd mean largest difs:;
run;
```

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to trphelps15

12-28-2017 04:09 PM

You could just add the "smallest" function to your code. e.g.:

data analysis; set analysis_set; array set(24) Spd_:; do while (StDev>10); call missing (set(whichn(smallest(1, of set(*)),of set(*)))); call missing (set(whichn(largest(1, of set(*)),of set(*)))); StDev=std(of set(*)); Mean=harmean(of set(*)); end; run;

Art, CEO, AnalystFinder.com

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to trphelps15

12-28-2017 11:51 AM

I don't do this kind of stuff much, but since it's a quiet day here, thought I'd throw out a suggestion that hopefully gives you an approach to consider.

Below uses arrays. After a record is read, it computes the mean and sd. Then uses a DO WHILE loop so that if the standard deviation is greater than a threshold, the inside of the loop will execute. Inside the loop it uses the LARGEST function to find the largest value in the array, and the WHICHN function to find the index of that variable (i.e. which position of the array it is in). Then it sets that variable to missing, and recomputes the mean and SD. I added a criterion of n>2 so that it would not attempt to compute the standard deviation of one value, which would throw an error. With this approach, it's possible that the final standard deviation could be greater than your threshold.

```
data have (keep=id Spd:);
array Spd {*} Spd0-Spd4;
do id=1 to 5;
do i=1 to dim(Spd);
spd{i}=ranuni(3);
end;
output;
end;
run;
data want;
set have;
array Spd{*} Spd: ;
iter=1;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
do while( (sd > .2) and (n>2) );
iter=iter+1;
*get index of variable with largest value in the array;
largest=whichn(largest(1,of Spd{*}),of Spd{*});
*Set the largest value to missing;
Spd{largest}=.;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
end;
run;
options ls=max nocenter;
proc print data=want;
var id iter Spd: n sd mean largest;
run;
```

Output like:

id iter Spd0 Spd1 Spd2 Spd3 Spd4 n sd mean largest 1 1 0.55489 0.91027 0.19947 0.77820 0.76481 5 0.27795 0.47590 . 1 2 0.55489 . 0.19947 0.77820 0.76481 4 0.27002 0.42517 2 1 3 0.55489 . 0.19947 . 0.76481 3 0.28577 0.36933 4 1 4 0.55489 . 0.19947 . . 2 0.25132 0.29345 5 2 1 0.90783 0.62894 0.59508 0.14938 0.19970 5 0.31821 0.31107 . 2 2 . 0.62894 0.59508 0.14938 0.19970 4 0.25378 0.26716 1 2 3 . . 0.59508 0.14938 0.19970 3 0.24410 0.22418 2 2 4 . . . 0.14938 0.19970 2 0.03558 0.17091 3 3 1 0.45796 0.57161 0.56018 0.20155 0.87107 5 0.24084 0.42274 . 3 2 0.45796 0.57161 0.56018 0.20155 . 4 0.17195 0.37454 5 4 1 0.89158 0.81784 0.06979 0.93029 0.67952 5 0.35311 0.26015 . 4 2 0.89158 0.81784 0.06979 . 0.67952 4 0.37375 0.22045 4 4 3 . 0.81784 0.06979 . 0.67952 3 0.39801 0.17623 1 4 4 . . 0.06979 . 0.67952 2 0.43115 0.12658 2 5 1 0.06471 0.23829 0.67566 0.86168 0.42737 5 0.32176 0.20300 . 5 2 0.06471 0.23829 0.67566 . 0.42737 4 0.26198 0.17043 4 5 3 0.06471 0.23829 . . 0.42737 3 0.18138 0.13643 3