BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
trphelps15
Fluorite | Level 6

Hopefully this is the right location...I've done a decent amount of digging, but can't seem to find any similar questions.

 

I'm using SAS Enterprise Guide 7.1...I'm trying to clean up a fairly large dataset that is calculating the harmonic mean and standard deviation of 24 hour variables (each beginning with Spd_). I would like to write the script in a way that reviews the 24 bins if the standard deviation is greater than 10, drops the largest value, re-calculates the standard deviation and mean, and continues to test each record until the standard deviation for each is below 10. The basic form that I've been trying to use is below, but it doesn't actually work. Is it possible to do this and if so, how? Thank you!

 

I've also attached an example of one record and what should happen with each iteration until the record no longer meets the review condition (the attachment is the excel file of the image posted below...I wasn't sure if the image would be expandable or not).

 

data work.analysis;

   set work.analysis_set;

   if StDev>10 then

      do;

      largest(1, of Spd_:)=.;

      StDev=std(of Spd_:);

      Mean=harmean(of Spd_;);

   end;

   repeat...

 

i.e.

Spd_CalculationExample.JPG

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:

data analysis;
  set analysis_set;
  array set(24) Spd_:;
  do while (StDev>10);
    x=largest(1, of set(*));
    y=whichn(x,of set(*));
    call missing (set(y));
    StDev=std(of set(*));
    Mean=harmean(of set(*));
  end;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post

5 REPLIES 5
art297
Opal | Level 21

You didn't indicate whether you just want to output the final result for each record, or all of the iterations, but you could modify the following accordingly. I also didn't build in any safeguards if the conditions aren't solvable, but those would be easy to add as well:

data analysis;
  set analysis_set;
  array set(24) Spd_:;
  do while (StDev>10);
    x=largest(1, of set(*));
    y=whichn(x,of set(*));
    call missing (set(y));
    StDev=std(of set(*));
    Mean=harmean(of set(*));
  end;
run;

Art, CEO, AnalystFinder.com

 

trphelps15
Fluorite | Level 6

Thank you, both of the answers provided were able to accomplish what I was looking for!

 

Unfortunately I'll have to go back to the drawing board as far as my cleanup process goes. After reviewing the results, I understand that I've artificially deflated my overall values and should be eliminating values at both ends of the spectrum (largest AND smallest values), as well as modifying my threshold further. These responses though should both provide me the groundwork to update as I further improve this specific dataset's scrubbing process.

 

Any advice for modifying the script to test both the largest and smallest values? I'm thinking a comparison to the mean...utilizing an absolute value of subtracting the two values and then removing the value that has the largest difference essentially.

 

Thank you both Art and Quentin!

Quentin
Super User

I didn't test it much, but you could make an array with the differences (residuals?) of each value from the mean.  And us then find the largest(difference) rather than largest (value).  Something like:

 

data have (keep=id Spd:);
 array Spd {*} Spd0-Spd4;
 do id=1 to 5;
   do i=1 to dim(Spd);
     spd{i}=100+sqrt(25)*rannor(3);
   end;
   output;
 end;
run;

data want; 
  set have;

  array Spd{*} Spd: ;

  array Difs{5} ; *Hard-coded number of elements in the Spd array! ;

  iter=1;
  n=n(of Spd{*});
  sd=std(of Spd{*});
  mean=harmean(of Spd{*});
  output;

  do while( (sd > .2) and (n>2) );
    iter=iter+1;

    *compute the difference between each value and the mean;
    do i=1 to dim(Difs);
      if missing(Spd{i}) then difs{i}=.;
      else difs{i}=abs(Spd{i}-mean);
    end;

    *get index of variable with largest value in the array of differences;
    largest=whichn(largest(1,of Difs{*}),of Difs{*});

    *Set the largest value to missing;
    Spd{largest}=.; 

    n=n(of Spd{*});
    sd=std(of Spd{*});
    mean=harmean(of Spd{*});
    output;   

  end;

run;

options ls=max nocenter;
proc print data=want;
 var id iter Spd: n sd mean largest difs:;
run;
BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
art297
Opal | Level 21

You could just add the "smallest" function to your code. e.g.:

data analysis;
  set analysis_set;
  array set(24) Spd_:;
  do while (StDev>10);
    call missing (set(whichn(smallest(1, of set(*)),of set(*))));
    call missing (set(whichn(largest(1, of set(*)),of set(*))));
    StDev=std(of set(*));
    Mean=harmean(of set(*));
  end;
run;

Art, CEO, AnalystFinder.com

 

Quentin
Super User

I don't do this kind of stuff much, but since it's a quiet day here, thought I'd throw out a suggestion that hopefully gives you an approach to consider.

 

Below uses arrays.  After a record is read, it computes the mean and sd.  Then uses a DO WHILE loop so that if the standard deviation is greater than a threshold, the inside of the loop will execute.  Inside the loop it uses the LARGEST function to find the largest value in the array, and the WHICHN function to find the index of that variable (i.e. which position of the array it is in).  Then it sets that variable to missing, and recomputes the mean and SD.  I added a criterion of n>2 so that it would not attempt to compute the standard deviation of one value, which would throw an error.  With this approach, it's possible that the final standard deviation could be greater than your threshold.

 

data have (keep=id Spd:);
 array Spd {*} Spd0-Spd4;
 do id=1 to 5;
   do i=1 to dim(Spd);
     spd{i}=ranuni(3);
   end;
   output;
 end;
run;

data want; 
  set have;

  array Spd{*} Spd: ;

  iter=1;
  n=n(of Spd{*});
  sd=std(of Spd{*});
  mean=harmean(of Spd{*});
  output;

  do while( (sd > .2) and (n>2) );
    iter=iter+1;

    *get index of variable with largest value in the array;
    largest=whichn(largest(1,of Spd{*}),of Spd{*});

    *Set the largest value to missing;
    Spd{largest}=.; 

    n=n(of Spd{*});
    sd=std(of Spd{*});
    mean=harmean(of Spd{*});
    output;   
  end;

run;

options ls=max nocenter;
proc print data=want;
 var id iter Spd: n sd mean largest;
run;

 

Output like:

id    iter      Spd0       Spd1       Spd2       Spd3       Spd4     n       sd        mean     largest

 1      1     0.55489    0.91027    0.19947    0.77820    0.76481    5    0.27795    0.47590       .
 1      2     0.55489     .         0.19947    0.77820    0.76481    4    0.27002    0.42517       2
 1      3     0.55489     .         0.19947     .         0.76481    3    0.28577    0.36933       4
 1      4     0.55489     .         0.19947     .          .         2    0.25132    0.29345       5
 2      1     0.90783    0.62894    0.59508    0.14938    0.19970    5    0.31821    0.31107       .
 2      2      .         0.62894    0.59508    0.14938    0.19970    4    0.25378    0.26716       1
 2      3      .          .         0.59508    0.14938    0.19970    3    0.24410    0.22418       2
 2      4      .          .          .         0.14938    0.19970    2    0.03558    0.17091       3
 3      1     0.45796    0.57161    0.56018    0.20155    0.87107    5    0.24084    0.42274       .
 3      2     0.45796    0.57161    0.56018    0.20155     .         4    0.17195    0.37454       5
 4      1     0.89158    0.81784    0.06979    0.93029    0.67952    5    0.35311    0.26015       .
 4      2     0.89158    0.81784    0.06979     .         0.67952    4    0.37375    0.22045       4
 4      3      .         0.81784    0.06979     .         0.67952    3    0.39801    0.17623       1
 4      4      .          .         0.06979     .         0.67952    2    0.43115    0.12658       2
 5      1     0.06471    0.23829    0.67566    0.86168    0.42737    5    0.32176    0.20300       .
 5      2     0.06471    0.23829    0.67566     .         0.42737    4    0.26198    0.17043       4
 5      3     0.06471    0.23829     .          .         0.42737    3    0.18138    0.13643       3

BASUG is hosting free webinars Next up: Jane Eslinger presenting PROC REPORT and the ODS EXCEL destination on Mar 27 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1133 views
  • 1 like
  • 3 in conversation