Hopefully this is the right location...I've done a decent amount of digging, but can't seem to find any similar questions.
I'm using SAS Enterprise Guide 7.1...I'm trying to clean up a fairly large dataset that is calculating the harmonic mean and standard deviation of 24 hour variables (each beginning with Spd_). I would like to write the script in a way that reviews the 24 bins if the standard deviation is greater than 10, drops the largest value, re-calculates the standard deviation and mean, and continues to test each record until the standard deviation for each is below 10. The basic form that I've been trying to use is below, but it doesn't actually work. Is it possible to do this and if so, how? Thank you!
I've also attached an example of one record and what should happen with each iteration until the record no longer meets the review condition (the attachment is the excel file of the image posted below...I wasn't sure if the image would be expandable or not).
data work.analysis;
set work.analysis_set;
if StDev>10 then
do;
largest(1, of Spd_:)=.;
StDev=std(of Spd_:);
Mean=harmean(of Spd_;);
end;
repeat...
i.e.
You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:
data analysis; set analysis_set; array set(24) Spd_:; do while (StDev>10); x=largest(1, of set(*)); y=whichn(x,of set(*)); call missing (set(y)); StDev=std(of set(*)); Mean=harmean(of set(*)); end; run;
Art, CEO, AnalystFinder.com
You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:
data analysis; set analysis_set; array set(24) Spd_:; do while (StDev>10); x=largest(1, of set(*)); y=whichn(x,of set(*)); call missing (set(y)); StDev=std(of set(*)); Mean=harmean(of set(*)); end; run;
Art, CEO, AnalystFinder.com
Thank you, both of the answers provided were able to accomplish what I was looking for!
Unfortunately I'll have to go back to the drawing board as far as my cleanup process goes. After reviewing the results, I understand that I've artificially deflated my overall values and should be eliminating values at both ends of the spectrum (largest AND smallest values), as well as modifying my threshold further. These responses though should both provide me the groundwork to update as I further improve this specific dataset's scrubbing process.
Any advice for modifying the script to test both the largest and smallest values? I'm thinking a comparison to the mean...utilizing an absolute value of subtracting the two values and then removing the value that has the largest difference essentially.
Thank you both Art and Quentin!
I didn't test it much, but you could make an array with the differences (residuals?) of each value from the mean. And us then find the largest(difference) rather than largest (value). Something like:
data have (keep=id Spd:);
array Spd {*} Spd0-Spd4;
do id=1 to 5;
do i=1 to dim(Spd);
spd{i}=100+sqrt(25)*rannor(3);
end;
output;
end;
run;
data want;
set have;
array Spd{*} Spd: ;
array Difs{5} ; *Hard-coded number of elements in the Spd array! ;
iter=1;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
do while( (sd > .2) and (n>2) );
iter=iter+1;
*compute the difference between each value and the mean;
do i=1 to dim(Difs);
if missing(Spd{i}) then difs{i}=.;
else difs{i}=abs(Spd{i}-mean);
end;
*get index of variable with largest value in the array of differences;
largest=whichn(largest(1,of Difs{*}),of Difs{*});
*Set the largest value to missing;
Spd{largest}=.;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
end;
run;
options ls=max nocenter;
proc print data=want;
var id iter Spd: n sd mean largest difs:;
run;
You could just add the "smallest" function to your code. e.g.:
data analysis; set analysis_set; array set(24) Spd_:; do while (StDev>10); call missing (set(whichn(smallest(1, of set(*)),of set(*)))); call missing (set(whichn(largest(1, of set(*)),of set(*)))); StDev=std(of set(*)); Mean=harmean(of set(*)); end; run;
Art, CEO, AnalystFinder.com
I don't do this kind of stuff much, but since it's a quiet day here, thought I'd throw out a suggestion that hopefully gives you an approach to consider.
Below uses arrays. After a record is read, it computes the mean and sd. Then uses a DO WHILE loop so that if the standard deviation is greater than a threshold, the inside of the loop will execute. Inside the loop it uses the LARGEST function to find the largest value in the array, and the WHICHN function to find the index of that variable (i.e. which position of the array it is in). Then it sets that variable to missing, and recomputes the mean and SD. I added a criterion of n>2 so that it would not attempt to compute the standard deviation of one value, which would throw an error. With this approach, it's possible that the final standard deviation could be greater than your threshold.
data have (keep=id Spd:);
array Spd {*} Spd0-Spd4;
do id=1 to 5;
do i=1 to dim(Spd);
spd{i}=ranuni(3);
end;
output;
end;
run;
data want;
set have;
array Spd{*} Spd: ;
iter=1;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
do while( (sd > .2) and (n>2) );
iter=iter+1;
*get index of variable with largest value in the array;
largest=whichn(largest(1,of Spd{*}),of Spd{*});
*Set the largest value to missing;
Spd{largest}=.;
n=n(of Spd{*});
sd=std(of Spd{*});
mean=harmean(of Spd{*});
output;
end;
run;
options ls=max nocenter;
proc print data=want;
var id iter Spd: n sd mean largest;
run;
Output like:
id iter Spd0 Spd1 Spd2 Spd3 Spd4 n sd mean largest 1 1 0.55489 0.91027 0.19947 0.77820 0.76481 5 0.27795 0.47590 . 1 2 0.55489 . 0.19947 0.77820 0.76481 4 0.27002 0.42517 2 1 3 0.55489 . 0.19947 . 0.76481 3 0.28577 0.36933 4 1 4 0.55489 . 0.19947 . . 2 0.25132 0.29345 5 2 1 0.90783 0.62894 0.59508 0.14938 0.19970 5 0.31821 0.31107 . 2 2 . 0.62894 0.59508 0.14938 0.19970 4 0.25378 0.26716 1 2 3 . . 0.59508 0.14938 0.19970 3 0.24410 0.22418 2 2 4 . . . 0.14938 0.19970 2 0.03558 0.17091 3 3 1 0.45796 0.57161 0.56018 0.20155 0.87107 5 0.24084 0.42274 . 3 2 0.45796 0.57161 0.56018 0.20155 . 4 0.17195 0.37454 5 4 1 0.89158 0.81784 0.06979 0.93029 0.67952 5 0.35311 0.26015 . 4 2 0.89158 0.81784 0.06979 . 0.67952 4 0.37375 0.22045 4 4 3 . 0.81784 0.06979 . 0.67952 3 0.39801 0.17623 1 4 4 . . 0.06979 . 0.67952 2 0.43115 0.12658 2 5 1 0.06471 0.23829 0.67566 0.86168 0.42737 5 0.32176 0.20300 . 5 2 0.06471 0.23829 0.67566 . 0.42737 4 0.26198 0.17043 4 5 3 0.06471 0.23829 . . 0.42737 3 0.18138 0.13643 3
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.