Re: How to reduce outlier values to next highest value within range

Mindy1 · Posted 11-18-2015 11:59 AM

Hello,

I have a large dataset that I need to clean prior to analysis. The dataset contains 84 variables related to scores on an exam. The scores should range between 50 and 500 but there are instances when values can be outside of than range. I need to reduce the influence of these outlying scores by changing the values to the next highest value in the dataset that is within range. Here is a small sample:

Subj Var1

1 20

2 520

3 55

4 490

5 19

If the code works then I should end up changing subjects' 1 & 5scores for Var1 to 55 (next lowest value within the range 50 - 500) and subjects 2 scores for Var1 to 490 (the next hightest value within range). There can be multiple subjects that have scores below 50 or above 500.

Thanks for your help!

Astounding · Posted 11-18-2015 01:33 PM

It's probably simplest to use a single DATA step, along these lines:

data want;

array actual {84} var1-var84;

array maxs {84} max1-max84;

array mins {84} min1-min84;

do until (done1);

set have end=done1;

do _n_=1 to 84;

if (maxs{_n_} < actuals{_n_} <= 500) then maxs{_n_} = actuals{_n_};

if (50 <= actuals{_n_}) and (mins{_n_}=. or actuals{_n_} < mins{_n_}) then mins{_n_} = actuals{_n_};

end;

do until (done2);

set have end=done2;

do _n_=1 to 84;

if actuals{_n_} > 500 then actuals{_n_} = maxs{_n_};

if actuals{_n_} < 50 then actuals{_n_} = mins{_n_};

end;

output;

end;

run;

The top loop captures the 84 minimums and maximums, and the bottom loop applies them to the data values. It's untested, but should need minor tweaks at most.

Good luck.

Reeza · Posted 11-18-2015 01:47 PM

Here's another way that uses some more basic coding. Use a proc means to find the min/max within the boundaries set and then merge the results in and check observations against the boundaries.

data have;
input Subj   Var1;
cards; 
1          20
2          520
3          55
4          490
5          19
;
run;

proc means data=have noprint;
	where var1 between 50 and 500;
	var var1;
	output out=min_max min(var1)=var1_min max(var1)=var1_max;
run;

data want;
	if _n_=1 then set min_max (drop= _:); /*Merge in the min/max from proc means*/
	set have;

	if 50<var1<500 then var1_capped=var1;
	else if var1<50 then var1_capped=var1_min;
	else if var1>500 then var1_capped=var1_max;

	drop var1_min var1_max;
run;

Astounding · Posted 11-18-2015 02:08 PM

Reeza,

That's viable for 1 variable ... I'd probably do it that way myself for 1 variable. But with 84 variables you would need 84 PROC MEANS ... it can be done, but it doesn't seem right.

Reeza · Posted 11-18-2015 02:12 PM

I missed the 84 variables part 🙂

You're right then!

Quentin · Posted 11-18-2015 03:10 PM

Oops, I missed the 84 variables part also. That said, my first thought was a SQL solution. Could be macroized to handle 84 variables, or could transpose the data to a normalized format and then changed the SQL to compute min and max within each group.

I'm still practicing SQL, I'm sure there are better SQL approaches. This one just computes the min and max, and then joins it on to the HAVE data. It's basically the SQL version of @Reeza's solution.

data have;
input Subj   Var1;
cards; 
1          20
2          520
3          55
4          490
5          19
;
run;


proc sql;
  create table want as
  select Subj
        ,Var1
        ,case when(.Z<Var1<min) then min 
              when(Var1>max) then max 
              else Var1 
        end as NewVar1
  from have
      ,(select min(var1) as min 
              ,max(var1) as max
        from have
        where 50<=Var1<=500
        )
  ;
run;

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Nazar · Posted 11-18-2015 02:12 PM

If I understood the question properly, it has a simple solution.

1. First find a MAX and a MIN value within the range. In this case, 490 and 55.

2. Then, check each value for var1

if var1 < 50 then var1 = MIN

else if var1 > 500 then var1 = MAX

The point is that, what matters for this solution is only two MIN/MAX values. It is the simplest solution.

PGStats · Posted 11-18-2015 10:40 PM

I prefer the clarity of SQL, with transpose/untranspose :


proc transpose data=have out=list name=scoreId;
by subj;
var var:;
run;

proc sql;
create table wantList as
select 
    subj, 
    scoreId,
    case 
        when col1 is missing then .
        when col1 < 50  
            then (select min(col1) from list where scoreId=a.scoreId and col1>= 50 )
        when col1 > 500 
            then (select max(col1) from list where scoreId=a.scoreId and col1<= 500)
        else col1 end as score
from list as a
order by subj, scoreId;
quit;

proc transpose data=wantList out=want(drop=_name_);
by subj;
var score;
id scoreId;
run;

PG

Quentin · Posted 11-19-2015 09:33 AM

Thanks @PGStats, that's just the sort of SQL magic I was hoping someone would post. I don't think I've ever written one of those correlated sub-queries, but I'm always impressed by what they can do.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

RogerSpeas · Posted 11-20-2015 04:25 AM

Did I understand that you were try to set the bottom/top (low and high) 84 columns independently?

If so, the SQL magic might require that you write 164 SQL sub-query, and thus 184 table scans.

Astounding's code, using the data step reads the table just twice.

But... we could super-charge the SQL code.

I might suggest adding some SAS magic dust, which could reduce the164 table scans into one scan for summary totals.

data classv / view=classv;

set sashelp.class;

array n(*) age--weight;

do i=1 to dim(n); drop i ;

if n{i}<=50 then n{i}=.;

else if n{i}>=100 then n{i}=.;

end;

run;

The SQL magic would look like this...

proc sql;

create view classv as

select case when (var1 not between 50 and 500) then . else var1 end as var1

...

case when (var84 not between 50 and 500) then . else var84 end as var84

from class;

quit;

(we won't split hair yet, but, recoding out of range value to missing... 7 lines of data step view code vs 88 lines of SQL view code)

What's my point...the magic here is to first recode the value outside of the 50-500 range to missing, without reading/writing the table, by using a view, SQL or DATA, your preference. With values recoded to missing, and unlike the subqueries which required independent WHERE expression, the SQL MIN/MAX summaries of this view would be the first values found. There is no need for independent WHERE expressions, nothing is outside the range for any of the columns except missing.

So, by re-coding extreme values to missing, we can summarize in one pass and put the results in a table (or inline view)...

proc sql;

create table lowhigh as

select min(age) as age_min, max(age) as max(age),

...

min(var84) as var84_max, max(var84) as var84_max

from classv;

quit;

The point again is to avoid 184 subqueries/table scans. However, the summary query would still need to have 168 summary expressions written for the 84 columns... so maybe another 88 lines of SQL code.

Now, Neeza's suggestion of PROC MEANS, and later recanted, when it was realized that one might be looking at writing 84 PROC MEANS step each with 84 independent WHERE expression with 84 table scans, just like the 84 subqueries. However, the magic of the recoded to missing view is one can easily consumed the view with a single pass of PROC MEANS much like SQL.

data classv / view=classv;

set sashelp.class;

array n(*) age--weight;

do i=1 to dim(n); drop i ;

if n{i}<=50 then n{i}=.;

else if n{i}>=100 then n{i}=.;

end;

run;

proc means data=classv noprint ;

var age--weight;

output out=lowhigh min()= max()= / autoname ;

run;

If you can live with some more SAS magic, the autoname option avoid writing those pesky 168 SQL summary expressions.

But, wait, we still have to do the bottom/top coding. We would next do a blind many-to-one join of the lowhigh bottom/top coding values back into the detail table...

proc sql;

create view classv as

select case when (var1<=50) then var1_min

when (var1>=500) then var1_max var1

else var1 end as var1,

....

case when (var84<=50) then var84_min

when (var84>=500) then var1_max var84

else var84 end as var84

from class, lowhigh;

quit;

We can call that, what, 84 case expressions or 250+ lines of code.

So in SQL, the view to recode to missing view, 84 lines, find the lowhigh summary table, 168 summary expressions, and finally bottom/top-coding the 84 columns, 84 case expressions... but just two table scans. (Albeit, you could write a macro program.)

The data step code for bottom/top coding...

data newclass;

set sashelp.class;

if _n_=1 then set lowhigh;

array actual{3} age--weight;

array stat{2,3} age_min--weight_max;

if stat{1,_i}=. then actual{_i}=75;

else if actual{_i}<=50 then actual{_i}=stat{1,_i};

else if actual{_i}>=100 then actual{_i}=stat{2,_i};

drop _: age_min--weight_max;

run;

Data step view, 8 lines of code; the PROC means summarization, 4 lines of code, the bottom/top coding, 10 lines of code (all of which could be tightened)

Astounding single data step code, 20 lines.

Where's the beef/magic?

Quentin · Posted 11-20-2015 11:26 AM

@RogerSpeas I think you miss some of the magic in @PGStats's approach. It transposed the data into a normalized format, and after that does not require 84 subqueries or 84 passes. The code shown is the complete solution (as I understand it).

I suppose it's one pass for the transpose step, not sure how many passes for the SQL optimizer will use (guessing less than 5, maybe only 1, perhaps it matters if it is indexed and/or hashable...) and another pass for the final transpose back to de-normalize the data. Of course some might argue that the data are better left in the normalized structure...

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

RogerSpeas · Posted 11-20-2015 03:37 PM

Yup... I missed the transpose. I saw col1 and but was thinking var1.

However, the code is recursive, as the subquery is a correlated subquery....

proc sql;

create table wantList as

select subj, scoreId, case

when col1 is missing then .

when col1 < 50

then (select min(col1) from list where scoreId=a.scoreId and col1>= 50 )

when col1 > 500

then (select max(col1) from list where scoreId=a.scoreId and col1<= 500)

else col1 end as score

from list as a

order by subj, scoreId;

quit;

Let say, the table is 1000 rows long and 85 columns wide... the transposed table would be 84000 rows/values.

For each of those 84000 rows... if an outlier occurs, the summary subqueries would need to scan the LIST. So if the out orange values were rare in WANTLIST table, lets say 20 values out of range, the LIST table, would be scanned/subqueried, 20 times. Additionally, if the table is small the read could be from cache.

data have;

array var(84);

do subj=1 to 1000;

do i=1 to 84; drop i;

var{i}= (ranuni(10)*480)+ 30 ;

end;

output;

end;

run;

proc transpose data=have out=list(index=(scoreId)) name=scoreId;

by subj;

var var:;

run;

options msglevel=i;

proc sql _method;

create table wantList as

...

And yes, the impact on I/O from the recursive correlated query could be reduced if an INDEX existed. However, the info notes the use of the INDEX would be cancelled... again likely as the table is somewhat small.

If one pre-summarize the low/high value into a table, and JOIN. As the Low/high table 3x84 would be quite small, I would agree a JOIN would like use the HASH (as it did below).

proc sql _method;

create table lowHigh as

select scoreid, min(col1) as min , max(col1) as max

from list

where col1 between 50 and 500

group by scoreid;

create table wantList as

select subj, l.scoreId, case

when col1 is missing then .

when col1 < 50 then min

when col1 > 500 then max

else col1 end as score

from list as l = lowHigh as s

where l.scoreid = s.scoreid

order by subj, l.scoreId;

quit;

So, I would agree the SAS transposing of the table simiplies the problem. My previous post was to illustrate that SQL-centric code could be both code intense and/or I/O intense. Either using ARRAYS to horizontally deal with number of columns or using TRANSPOSE to vertically array those columns, both would signiicantly reduce the pure SQL alternative.

Here, transpose, summary query, bottom/top code query, transpose... 4 table scans.

Whichever method is used, I believe it is useful to create a summary table, lowHigh, to join back with the detail data. In my prior post, I was just trying to point out that the recoded view would simplify Reeza's PROC MEANS example, and could as well simplify the some of pure SQL summary code.

Quentin · Posted 11-20-2015 08:45 PM

Thanks for the lesson, @RogerSpeas. As I mentioned, I don't write a lot of SQL. Guess I had hoped with the correlated subquery that the optimizer might be smart enough to sort both tables (if not indexed), and then approach the correlated sub-query with something like the double-DOW loop approach that Astounding used (point being to process each by-group in turn, so would only need one pass of each dataset after the sorts). But that's probably hoping for too much. Correlated sub-queries make me scratch my head.

And I like your last solution, with joining on the pre-summarized table. That's the kind of SQL where I don't have to scratch my head, and I think "wow, it's fun to read SQL, because all you have to think about is sets." I spent so much time learning the intracies of the DATA step processes for merges and BY group interleaving and DOW loops.... it took me a long time to get over the fact that with SQL, I wasn't telling SAS how to do something, I was telling SAS what I wanted done.

Cheers,

--Q.

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Ready to join fellow brilliant minds for the SAS Hackathon?