turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Using SAS ARRAYS to replace extreme values with a ...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 06:14 AM

I have a data set containing 60 variables with positively skew distributions, bounded below by 0 but containing very large extreme values which are most likely data errors.

I have used the array method to replace all the erroneous negative values with 0.

I would like to replace all the large values in each variable with the 99th percentile value which is much closer to the rest of the values.

I am using SAS University Edition with SAS Studio 3.5

I would appreciate it if you could help me with this task.

I have tried to do this using arrays in a data step and the program runs without error but when I check the output data with Proc Means, the very large values are still there.

I am not sure how to use the Percentile Function: P99 = PCTL(99, OF D[K]) where D is an array.

Thank you

Accepted Solutions

Solution

05-14-2016
06:31 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 06:19 PM - edited 05-14-2016 06:23 PM

Hi @JonDickens1607,

Sorry for the delay. I was away from my workstation. Big thank you to @Tom for stepping in!

**A)** As Tom has already explained, I used the min operator, which I could (or rather should) have written min rather than ><. My first idea was to write in the DO loop (in my first example) as follows:

`if v[d+i]>p[i] then v[d+i]=p[i];`

This would have been clearer. Nevertheless, in this context it is equivalent to the shorter assignment statement

`v[d+i]=v[d+i] min p[i];`

(now using the less cryptic notation for the min operator).

Honestly, in the past 18 years I have virtually always used the MIN *function* and virtually never the MIN *operator*. But these two are __not__ equivalent. The MIN and MAX *operators* differ from the MIN and MAX *functions* in how they handle missing values: The functions compute the minimum (or maximum, respectively) of the *non-missing* arguments, whereas the operators determine the smaller (or greater, resp.) of the two values surrounding them in terms of the usual sort order for missing, special missing and non-missing values:

._ < . < .A < .B < ... < .Z <non-missing numeric values

In your case it would be incorrect to write

`v[d+i]=min(v[d+i], p[i]);`

__ if__ v[d+i] was a missing value for one or more observations and values of i! The MIN

The RETAIN statement I used in the very first version of my post (and which I removed only a few minutes later) was redundant. (It had to do with an earlier draft version of the code.) Sorry, if it confused you. Setting D=DIM(P) was just to avoid repetitive calls of the DIM function in the code (i.e. a kind of abbreviation).

The reason for the somewhat "surprising" D+I index in the first example is: With the first SET statement the 3 (in your case: 60) numeric variables from dataset PERCTLS go into the program data vector (PDV). The second SET statment adds all variables from SASHELP.CLASS to the PDV. Hence, the variable list _numeric_ in the second ARRAY statement includes 2*3 (in your case 2*60) numeric variables: The percentiles (P75_1, P75_2, P75_3 in the example), followed by the original variables (AGE, HEIGHT, WEIGHT in the example). So, AGE, HEIGHT and WEIGHT are v[**4**], v[5] and v[6], respectively, and the corresponding percentiles P75_1, P75_2 and P75_3 are p[**1**], p[2] and p[3], resp. D=DIM(P) is the "offset" that must be used in the indices of array V to match elements of V and elements of P correctly.

This complication was one reason why I added the second example. Here, I don't use the comprehensive variable list _numeric_ for the second array definition, but a variable list comprising variables from SASHELP.HEART only. The same variable list was (of course) used in the VAR statement of the PROC SUMMARY step. Thus, the two arrays match 1:1 and there is no need for an "offset".

**B)** The VARNUM option in the PROC CONTENTS statement of the second example is important to determine the correct specification of the abbreviated variable list, because variable lists of this kind refer to the variable order in the PDV (see the documentation I linked to).

AgeCHDdiag-numeric-Weight

is (in this example) a shorthand notation for

AgeCHDdiag AgeAtStart Height Weight

namely all *numeric* variables in the PDV from AgeCHDdiag to Weight. I added MRW and Cholesterol arbitrarily. Obviously, with 60 variables as in your case, variable lists are particularly convenient.

PROC SUMMARY computes the 99% percentiles and writes them to variables P99_1, P99_2, ..., P99_6 in dataset PERCTLS (which contains only one observation). Default variables _TYPE_ and _FREQ_ of the output dataset are dropped using the DROP= dataset option and the colon abbreviation for "all variables whose names start with an underscore." The list ends with P99_6 (and not P99_888) because it corresponds 1:1 to the variable list in the VAR statement, which consists of 6 variables.

As described earlier, the two SET statements in the data step bring the percentiles (P99_1, ...) and the original variables of SASHELP.HEART (Status, DeathCause, AgeCHDdiag, ...) together side by side in the PDV. The second ARRAY statement *must* use the same variable list as was used in PROC SUMMARY. Otherwise, variable values would be -- inappropriately -- compared to percentiles of different variables!

The assignment statement in the DO loop replaces "extremely large values" ("outliers") by the corresponding 99% percentiles, as desired. Finally, the DROP statement removes the index variable I and the percentile variables (using the colon abbreviation for "all variables whose names start with 'P99_'"), assuming you don't need the percentiles (which are constant across all observations!) in dataset WANT. They are available in dataset PERCTLS without duplicates after all.

The possibility to get rid of the percentile variables using the short notation P99_: was the reason for specifying these names in the OUTPUT statement of the PROC SUMMARY step and *not* using the AUTONAME option instead (which would have created variable names such Height_P99, Weight_P99 etc.). However, thinking again about it, the variable list

AgeCHDdiag_P99--Cholesterol_P99

which could replace p99_: in the DROP statement would not have been much longer. So, it might even be better to use this in conjunction with the AUTONAME option in PROC SUMMARY:

`output out=perctls(drop=_:) p99= /autoname;`

The advantage would be that the name of each percentile variable contains the name of the corresponding original variable, which is probably helpful when dealing with 60 variables (unless these are numbered like VAR1, ..., VAR60 anyway).

(Edit: only removed some vertical white space)

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 07:54 AM - edited 05-14-2016 08:03 AM

Hello @JonDickens1607,

It was a good idea to use arrays. But the PCTL function applied to an array operates on the array values within the respective observation, i.e., it computes one percentile per *row*, whereas you need one percentile per *column.*

Here is an example showing how you could proceed:

```
/* Compute 75% percentiles */
proc summary data=sashelp.class;
var _numeric_;
output out=perctls(drop=_:) p75=p75_1-p75_3;
run;
/* Replace larger values by the 75% percentiles */
data want;
if _n_=1 then set perctls;
array p _all_;
set sashelp.class;
array v _numeric_;
d=dim(p);
do i=1 to d;
v[d+i]=v[d+i]><p[i];
end;
drop d i p75_:;
run;
proc print data=sashelp.class;
run;
proc print data=want;
run;
```

I use 75% percentiles for demonstration because sashelp.class contains only 19 observations. Just search and replace 75 by 99 and adapt the variable lists _numeric_ (2 instances) and the "3" in p75_1-p75_**3**. And please feel free to ask if anything is unclear.

Edit: Depending on how you define the variable lists, the indices might need to be adapted, too.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 08:01 AM

Thank you for your response to my question.

I will apply your suggestion to my SAS Code and then let you know the

result.

Cheers

##- Please type your reply above this line. Simple formatting, no

attachments. -##

I will apply your suggestion to my SAS Code and then let you know the

result.

Cheers

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 10:07 AM

Here is another example, very similar to the previous one, but possibly even closer to your case. Your 60 variables shall be represented by 6 variables, arbitrarily selected from SASHELP.HEART.

```
proc contents data=sashelp.heart varnum; /* VARNUM is important! */
run;
```

Select variables and form variable list (cf. documentation) based on PROC CONTENTS output:

Variablesin Creation Order# Variable Type Len Label 1 Status Char 5 2 DeathCause Char 26 Cause of Death 3 AgeCHDdiag Num 8 Age CHD Diagnosed 4 Sex Char 6 5 AgeAtStart Num 8 Age at Start 6 Height Num 8 7 Weight Num 8 8 Diastolic Num 8 9 Systolic Num 8 10 MRW Num 8 Metropolitan Relative Weight 11 Smoking Num 8 12 AgeAtDeath Num 8 Age at Death 13 Cholesterol Num 8 14 Chol_Status Char 10 Cholesterol Status ...

```
/* Compute 99% percentiles of selected variables */
proc summary data=sashelp.heart;
var AgeCHDdiag-numeric-Weight MRW Cholesterol;
output out=perctls(drop=_:) p99=p99_1-p99_888; /* It doesn't hurt that 888 is way too large: */
run; /* Only p99_1-p99_6 are in fact created. */
/* Replace larger values by the 99% percentiles */
data want;
if _n_=1 then set perctls;
array p _all_;
set sashelp.heart;
array v AgeCHDdiag-numeric-Weight MRW Cholesterol;
do i=1 to dim(v);
v[i]=v[i]><p[i];
end;
drop i p99_:;
run;
```

Due to the different way of specifying the variable list for array v, both arrays have now the same dimension (6), which simplifies the indices in the DO loop.

Checks of dataset WANT could include a calculation of maximum values of the modified variables (these should now match the percentiles in dataset PERCTLS) and a PROC COMPARE with the original dataset (showing the modified values):

```
proc means data=want max;
var AgeCHDdiag-numeric-Weight MRW Cholesterol;
run;
proc print data=perctls;
run;
proc compare data=sashelp.heart c=want;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 03:14 PM

Thank you for your kind assistance.

A. Please explain the syntax:

solution 1: V[D+I] = V[D+I] >< P[I]

Using Array P and Array V and Retaining D = DIM(P).

solution 2: V[I] =V[I] >

A. Please explain the syntax:

solution 1: V[D+I] = V[D+I] >< P[I]

Using Array P and Array V and Retaining D = DIM(P).

solution 2: V[I] =V[I] >

Using Array P and Array V without D

B. Please explain the logic behind your SAS Code.

Thank you

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 03:33 PM

This

` V[D+I] >< P[I]`

is using the min operator. It would probably be clearer to use the MIN() function instead.

` min(V[D+I],P[I])`

So the result will be that outliers will be truncated and replaced with 99th percentile value for that variable.

Note that is in only going to truncate the high values. If you also have extremely low values then you will need to also output the 1st percentale and use a similar technique to set extremely low values to the 1st percentile value.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 03:47 PM

Hi Tom,

Thanks for the clarification.

Given that I am using a function with arrays which is the best syntax to

use:

a. MIN( V[D+I] , P[I] )

b. MIN( OF V[D+I] , P[I] )

c. MIN( V[*] , P[*] )

d. MIN( OF V[*] , P[*] )

Please explain your response.

Cheers

##- Please type your reply above this line. Simple formatting, no

attachments. -##

Thanks for the clarification.

Given that I am using a function with arrays which is the best syntax to

use:

a. MIN( V[D+I] , P[I] )

b. MIN( OF V[D+I] , P[I] )

c. MIN( V[*] , P[*] )

d. MIN( OF V[*] , P[*] )

Please explain your response.

Cheers

##- Please type your reply above this line. Simple formatting, no

attachments. -##

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 04:06 PM

Depends what you want to do. Let's first check them out.

A is take the minimum of two values. Not sure why the index is different, but the syntax is valid as long as the values of D+I and I are within the range of index values for the respective arrays.

B is valid and is really the same as A. Note that for B SAS will first separate on the commas and then the check the use of OF keyword in the first parameter.

C is invalid . You cannot use * as the index of the array without using the OF keyword to support a list of values.

D is invalid also. The first parameter (of v(*)) is valid, but you cannot use the * index in the second parameter. Either remove the unneeded comma or add another OF keyword.

Note that if you corrected C and D to use valid syntax, say by using min(of v(*) p(*)). then it would work but have a totally different meaning than A and B. A and B is taking the min of two values and the correct C is taking the min over all of the elements of two arrays.

Solution

05-14-2016
06:31 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 06:19 PM - edited 05-14-2016 06:23 PM

Hi @JonDickens1607,

Sorry for the delay. I was away from my workstation. Big thank you to @Tom for stepping in!

**A)** As Tom has already explained, I used the min operator, which I could (or rather should) have written min rather than ><. My first idea was to write in the DO loop (in my first example) as follows:

`if v[d+i]>p[i] then v[d+i]=p[i];`

This would have been clearer. Nevertheless, in this context it is equivalent to the shorter assignment statement

`v[d+i]=v[d+i] min p[i];`

(now using the less cryptic notation for the min operator).

Honestly, in the past 18 years I have virtually always used the MIN *function* and virtually never the MIN *operator*. But these two are __not__ equivalent. The MIN and MAX *operators* differ from the MIN and MAX *functions* in how they handle missing values: The functions compute the minimum (or maximum, respectively) of the *non-missing* arguments, whereas the operators determine the smaller (or greater, resp.) of the two values surrounding them in terms of the usual sort order for missing, special missing and non-missing values:

._ < . < .A < .B < ... < .Z <non-missing numeric values

In your case it would be incorrect to write

`v[d+i]=min(v[d+i], p[i]);`

__ if__ v[d+i] was a missing value for one or more observations and values of i! The MIN

The RETAIN statement I used in the very first version of my post (and which I removed only a few minutes later) was redundant. (It had to do with an earlier draft version of the code.) Sorry, if it confused you. Setting D=DIM(P) was just to avoid repetitive calls of the DIM function in the code (i.e. a kind of abbreviation).

The reason for the somewhat "surprising" D+I index in the first example is: With the first SET statement the 3 (in your case: 60) numeric variables from dataset PERCTLS go into the program data vector (PDV). The second SET statment adds all variables from SASHELP.CLASS to the PDV. Hence, the variable list _numeric_ in the second ARRAY statement includes 2*3 (in your case 2*60) numeric variables: The percentiles (P75_1, P75_2, P75_3 in the example), followed by the original variables (AGE, HEIGHT, WEIGHT in the example). So, AGE, HEIGHT and WEIGHT are v[**4**], v[5] and v[6], respectively, and the corresponding percentiles P75_1, P75_2 and P75_3 are p[**1**], p[2] and p[3], resp. D=DIM(P) is the "offset" that must be used in the indices of array V to match elements of V and elements of P correctly.

This complication was one reason why I added the second example. Here, I don't use the comprehensive variable list _numeric_ for the second array definition, but a variable list comprising variables from SASHELP.HEART only. The same variable list was (of course) used in the VAR statement of the PROC SUMMARY step. Thus, the two arrays match 1:1 and there is no need for an "offset".

**B)** The VARNUM option in the PROC CONTENTS statement of the second example is important to determine the correct specification of the abbreviated variable list, because variable lists of this kind refer to the variable order in the PDV (see the documentation I linked to).

AgeCHDdiag-numeric-Weight

is (in this example) a shorthand notation for

AgeCHDdiag AgeAtStart Height Weight

namely all *numeric* variables in the PDV from AgeCHDdiag to Weight. I added MRW and Cholesterol arbitrarily. Obviously, with 60 variables as in your case, variable lists are particularly convenient.

PROC SUMMARY computes the 99% percentiles and writes them to variables P99_1, P99_2, ..., P99_6 in dataset PERCTLS (which contains only one observation). Default variables _TYPE_ and _FREQ_ of the output dataset are dropped using the DROP= dataset option and the colon abbreviation for "all variables whose names start with an underscore." The list ends with P99_6 (and not P99_888) because it corresponds 1:1 to the variable list in the VAR statement, which consists of 6 variables.

As described earlier, the two SET statements in the data step bring the percentiles (P99_1, ...) and the original variables of SASHELP.HEART (Status, DeathCause, AgeCHDdiag, ...) together side by side in the PDV. The second ARRAY statement *must* use the same variable list as was used in PROC SUMMARY. Otherwise, variable values would be -- inappropriately -- compared to percentiles of different variables!

The assignment statement in the DO loop replaces "extremely large values" ("outliers") by the corresponding 99% percentiles, as desired. Finally, the DROP statement removes the index variable I and the percentile variables (using the colon abbreviation for "all variables whose names start with 'P99_'"), assuming you don't need the percentiles (which are constant across all observations!) in dataset WANT. They are available in dataset PERCTLS without duplicates after all.

The possibility to get rid of the percentile variables using the short notation P99_: was the reason for specifying these names in the OUTPUT statement of the PROC SUMMARY step and *not* using the AUTONAME option instead (which would have created variable names such Height_P99, Weight_P99 etc.). However, thinking again about it, the variable list

AgeCHDdiag_P99--Cholesterol_P99

which could replace p99_: in the DROP statement would not have been much longer. So, it might even be better to use this in conjunction with the AUTONAME option in PROC SUMMARY:

`output out=perctls(drop=_:) p99= /autoname;`

The advantage would be that the name of each percentile variable contains the name of the corresponding original variable, which is probably helpful when dealing with 60 variables (unless these are numbered like VAR1, ..., VAR60 anyway).

(Edit: only removed some vertical white space)

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-14-2016 06:32 PM

Thank you very much for your detailed explanation.

It has been most helpful and I have applied your code to my SAS program and

it has worked perfectly.

After many frustrating hours of trial and error, it feels good to have a

simple solution which I understand and could adapt in future if I need to

do so.

##- Please type your reply above this line. Simple formatting, no

attachments. -##

It has been most helpful and I have applied your code to my SAS program and

it has worked perfectly.

After many frustrating hours of trial and error, it feels good to have a

simple solution which I understand and could adapt in future if I need to

do so.

##- Please type your reply above this line. Simple formatting, no

attachments. -##