BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JonDickens1607
Obsidian | Level 7

I have a data set containing 60 variables with positively skew distributions, bounded below by 0 but containing very large extreme values which are most likely data errors.

 

I have used the array method to replace all the erroneous negative values with 0.

 

I would like to replace all the large values in each variable with the 99th percentile value which is much closer to the rest of the values.

 

I am using SAS University Edition with SAS Studio 3.5

 

I would appreciate it if you could help me with this task.

 

I have tried to do this using arrays in a data step and the program runs without error but when I check the output data with Proc Means, the very large values are still there.

 

I am not sure how to use the Percentile Function: P99 = PCTL(99, OF D[K]) where D is an array.

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @JonDickens1607,

 

Sorry for the delay. I was away from my workstation. Big thank you to @Tom for stepping in!

 

A) As Tom has already explained, I used the min operator, which I could (or rather should) have written min rather than ><. My first idea was to write in the DO loop (in my first example) as follows:

 

if v[d+i]>p[i] then v[d+i]=p[i];

This would have been clearer. Nevertheless, in this context it is equivalent to the shorter assignment statement

v[d+i]=v[d+i] min p[i];

(now using the less cryptic notation for the min operator).

 

Honestly, in the past 18 years I have virtually always used the MIN function and virtually never the MIN operator. But these two are not equivalent. The MIN and MAX operators differ from the MIN and MAX functions in how they handle missing values: The functions compute the minimum (or maximum, respectively) of the non-missing arguments, whereas the operators determine the smaller (or greater, resp.) of the two values surrounding them in terms of the usual sort order for missing, special missing and non-missing values: 

._ < . < .A < .B < ... < .Z < non-missing numeric values

In your case it would be incorrect to write 

v[d+i]=min(v[d+i], p[i]);

if v[d+i] was a missing value for one or more observations and values of i! The MIN function would replace the missing value with the percentile p[i], which is most likely not what you want. In contrast, the assigment statement using the MIN operator leaves the missing value unchanged (and the same holds for the conditional assignment statement if v[d+i]>p[i] then v[d+i]=p[i];).

 

The RETAIN statement I used in the very first version of my post (and which I removed only a few minutes later) was redundant. (It had to do with an earlier draft version of the code.) Sorry, if it confused you. Setting D=DIM(P) was just to avoid repetitive calls of the DIM function in the code (i.e. a kind of abbreviation).

 

The reason for the somewhat "surprising" D+I index in the first example is: With the first SET statement the 3 (in your case: 60) numeric variables from dataset PERCTLS go into the program data vector (PDV). The second SET statment adds all variables from SASHELP.CLASS to the PDV. Hence, the variable list _numeric_ in the second ARRAY statement includes 2*3 (in your case 2*60) numeric variables: The percentiles (P75_1, P75_2, P75_3 in the example), followed by the original variables (AGE, HEIGHT, WEIGHT in the example). So, AGE, HEIGHT and WEIGHT are v[4], v[5] and v[6], respectively, and the corresponding percentiles P75_1, P75_2 and P75_3 are p[1], p[2] and p[3], resp. D=DIM(P) is the "offset" that must be used in the indices of array V to match elements of V and elements of P correctly.

 

This complication was one reason why I added the second example. Here, I don't use the comprehensive variable list _numeric_ for the second array definition, but a variable list comprising variables from SASHELP.HEART only. The same variable list was (of course) used in the VAR statement of the PROC SUMMARY step. Thus, the two arrays match 1:1 and there is no need for an "offset".

 

B) The VARNUM option in the PROC CONTENTS statement of the second example is important to determine the correct specification of the abbreviated variable list, because variable lists of this kind refer to the variable order in the PDV (see the documentation I linked to).

 

AgeCHDdiag-numeric-Weight

is (in this example) a shorthand notation for 

AgeCHDdiag AgeAtStart Height Weight

namely all numeric variables in the PDV from AgeCHDdiag to Weight. I added MRW and Cholesterol arbitrarily. Obviously, with 60 variables as in your case, variable lists are particularly convenient.

 

PROC SUMMARY computes the 99% percentiles and writes them to variables P99_1, P99_2, ..., P99_6 in dataset PERCTLS (which contains only one observation). Default variables _TYPE_ and _FREQ_ of the output dataset are dropped using the DROP= dataset option and the colon abbreviation for "all variables whose names start with an underscore." The list ends with P99_6 (and not P99_888) because it corresponds 1:1 to the variable list in the VAR statement, which consists of 6 variables.

 

As described earlier, the two SET statements in the data step bring the percentiles (P99_1, ...) and the original variables of SASHELP.HEART (Status, DeathCause, AgeCHDdiag, ...) together side by side in the PDV. The second ARRAY statement must use the same variable list as was used in PROC SUMMARY. Otherwise, variable values would be -- inappropriately -- compared to percentiles of different variables!

 

The assignment statement in the DO loop replaces "extremely large values" ("outliers") by the corresponding 99% percentiles, as desired. Finally, the DROP statement removes the index variable I and the percentile variables (using the colon abbreviation for "all variables whose names start with 'P99_'"), assuming you don't need the percentiles (which are constant across all observations!) in dataset WANT. They are available in dataset PERCTLS without duplicates after all.

 

The possibility to get rid of the percentile variables using the short notation P99_: was the reason for specifying these names in the OUTPUT statement of the PROC SUMMARY step and not using the AUTONAME option instead (which would have created variable names such Height_P99, Weight_P99 etc.). However, thinking again about it, the variable list 

AgeCHDdiag_P99--Cholesterol_P99

which could replace p99_: in the DROP statement would not have been much longer. So, it might even be better to use this in conjunction with the AUTONAME option in PROC SUMMARY:

output out=perctls(drop=_:) p99= /autoname;

The advantage would be that the name of each percentile variable contains the name of the corresponding original variable, which is probably helpful when dealing with 60 variables (unless these are numbered like VAR1, ..., VAR60 anyway).

 

(Edit: only removed some vertical white space)

View solution in original post

9 REPLIES 9
FreelanceReinh
Jade | Level 19

Hello @JonDickens1607,

 

It was a good idea to use arrays. But the PCTL function applied to an array operates on the array values within the respective observation, i.e., it computes one percentile per row, whereas you need one percentile per column.

 

Here is an example showing how you could proceed:

 

/* Compute 75% percentiles */

proc summary data=sashelp.class;
var _numeric_;
output out=perctls(drop=_:) p75=p75_1-p75_3;
run;

/* Replace larger values by the 75% percentiles */

data want;
if _n_=1 then set perctls;
array p _all_;
set sashelp.class;
array v _numeric_;
d=dim(p);
do i=1 to d;
  v[d+i]=v[d+i]><p[i];
end;
drop d i p75_:;
run;

proc print data=sashelp.class;
run;

proc print data=want;
run;

I use 75% percentiles for demonstration because sashelp.class contains only 19 observations. Just search and replace 75 by 99 and adapt the variable lists _numeric_ (2 instances) and the "3" in p75_1-p75_3. And please feel free to ask if anything is unclear.

 

Edit: Depending on how you define the variable lists, the indices might need to be adapted, too.

JonDickens1607
Obsidian | Level 7
Thank you for your response to my question.

I will apply your suggestion to my SAS Code and then let you know the
result.

Cheers

##- Please type your reply above this line. Simple formatting, no
attachments. -##
FreelanceReinh
Jade | Level 19

Here is another example, very similar to the previous one, but possibly even closer to your case. Your 60 variables shall be represented by 6 variables, arbitrarily selected from SASHELP.HEART.

 

proc contents data=sashelp.heart varnum; /* VARNUM is important! */
run;

Select variables and form variable list (cf. documentation) based on PROC CONTENTS output:

                    Variables in Creation Order

 #    Variable          Type    Len    Label

 1    Status            Char      5
 2    DeathCause        Char     26    Cause of Death
 3    AgeCHDdiag        Num       8    Age CHD Diagnosed
 4    Sex               Char      6
 5    AgeAtStart        Num       8    Age at Start
 6    Height            Num       8
 7    Weight            Num       8
 8    Diastolic         Num       8
 9    Systolic          Num       8
10    MRW               Num       8    Metropolitan Relative Weight
11    Smoking           Num       8
12    AgeAtDeath        Num       8    Age at Death
13    Cholesterol       Num       8
14    Chol_Status       Char     10    Cholesterol Status
...
/* Compute 99% percentiles of selected variables */

proc summary data=sashelp.heart;
var AgeCHDdiag-numeric-Weight MRW Cholesterol;
output out=perctls(drop=_:) p99=p99_1-p99_888; /* It doesn't hurt that 888 is way too large: */
run;                                           /* Only p99_1-p99_6 are in fact created.      */

/* Replace larger values by the 99% percentiles */ 

data want;
if _n_=1 then set perctls;
array p _all_;
set sashelp.heart;
array v AgeCHDdiag-numeric-Weight MRW Cholesterol;
do i=1 to dim(v);
  v[i]=v[i]><p[i];
end;
drop i p99_:;
run;

Due to the different way of specifying the variable list for array v, both arrays have now the same dimension (6), which simplifies the indices in the DO loop.

 

Checks of dataset WANT could include a calculation of maximum values of the modified variables (these should now match the percentiles in dataset PERCTLS) and a PROC COMPARE with the original dataset (showing the modified values):

proc means data=want max;
var AgeCHDdiag-numeric-Weight MRW Cholesterol;
run;

proc print data=perctls;
run;

proc compare data=sashelp.heart c=want;
run;

 

JonDickens1607
Obsidian | Level 7
Thank you for your kind assistance.

A. Please explain the syntax:

solution 1: V[D+I] = V[D+I] >< P[I]
Using Array P and Array V and Retaining D = DIM(P).

solution 2: V[I] =V[I] >

Using Array P and Array V without D

B. Please explain the logic behind your SAS Code.

Thank you

##- Please type your reply above this line. Simple formatting, no
attachments. -##

Tom
Super User Tom
Super User

This 

 V[D+I] >< P[I]

is using the min operator.  It would probably be clearer to use the MIN() function instead.

 min(V[D+I],P[I])

So the result will be that outliers will be truncated and replaced with 99th percentile value for that variable.

Note that is in only going to truncate the high values.  If you also have extremely low values then you will need to also output the 1st percentale and use a similar technique to set extremely low values to the 1st percentile value.

JonDickens1607
Obsidian | Level 7
Hi Tom,

Thanks for the clarification.

Given that I am using a function with arrays which is the best syntax to
use:

a. MIN( V[D+I] , P[I] )

b. MIN( OF V[D+I] , P[I] )

c. MIN( V[*] , P[*] )

d. MIN( OF V[*] , P[*] )

Please explain your response.

Cheers

##- Please type your reply above this line. Simple formatting, no
attachments. -##
Tom
Super User Tom
Super User

Depends what you want to do. Let's first check them out.

A is take the minimum of two values.  Not sure why the index is different, but the syntax is valid as long as the values of D+I and I are within the range of index values for the respective arrays.

 

B is valid and is really the same as A.  Note that for B SAS will first separate on the commas and then the check the use of OF keyword in the first parameter.

 

C is invalid . You cannot use * as the index of the array without using the OF keyword to support a list of values.

D is invalid also.  The first parameter (of v(*)) is valid, but you cannot use the * index in the second parameter.  Either remove the unneeded comma or add another OF keyword.

 

Note that if you corrected C and D to use valid syntax, say by using min(of v(*) p(*)). then it would work but have a totally different meaning than A and B.  A and B is taking the min of two values and the correct C is taking the min over all of the elements of two arrays.

 

FreelanceReinh
Jade | Level 19

Hi @JonDickens1607,

 

Sorry for the delay. I was away from my workstation. Big thank you to @Tom for stepping in!

 

A) As Tom has already explained, I used the min operator, which I could (or rather should) have written min rather than ><. My first idea was to write in the DO loop (in my first example) as follows:

 

if v[d+i]>p[i] then v[d+i]=p[i];

This would have been clearer. Nevertheless, in this context it is equivalent to the shorter assignment statement

v[d+i]=v[d+i] min p[i];

(now using the less cryptic notation for the min operator).

 

Honestly, in the past 18 years I have virtually always used the MIN function and virtually never the MIN operator. But these two are not equivalent. The MIN and MAX operators differ from the MIN and MAX functions in how they handle missing values: The functions compute the minimum (or maximum, respectively) of the non-missing arguments, whereas the operators determine the smaller (or greater, resp.) of the two values surrounding them in terms of the usual sort order for missing, special missing and non-missing values: 

._ < . < .A < .B < ... < .Z < non-missing numeric values

In your case it would be incorrect to write 

v[d+i]=min(v[d+i], p[i]);

if v[d+i] was a missing value for one or more observations and values of i! The MIN function would replace the missing value with the percentile p[i], which is most likely not what you want. In contrast, the assigment statement using the MIN operator leaves the missing value unchanged (and the same holds for the conditional assignment statement if v[d+i]>p[i] then v[d+i]=p[i];).

 

The RETAIN statement I used in the very first version of my post (and which I removed only a few minutes later) was redundant. (It had to do with an earlier draft version of the code.) Sorry, if it confused you. Setting D=DIM(P) was just to avoid repetitive calls of the DIM function in the code (i.e. a kind of abbreviation).

 

The reason for the somewhat "surprising" D+I index in the first example is: With the first SET statement the 3 (in your case: 60) numeric variables from dataset PERCTLS go into the program data vector (PDV). The second SET statment adds all variables from SASHELP.CLASS to the PDV. Hence, the variable list _numeric_ in the second ARRAY statement includes 2*3 (in your case 2*60) numeric variables: The percentiles (P75_1, P75_2, P75_3 in the example), followed by the original variables (AGE, HEIGHT, WEIGHT in the example). So, AGE, HEIGHT and WEIGHT are v[4], v[5] and v[6], respectively, and the corresponding percentiles P75_1, P75_2 and P75_3 are p[1], p[2] and p[3], resp. D=DIM(P) is the "offset" that must be used in the indices of array V to match elements of V and elements of P correctly.

 

This complication was one reason why I added the second example. Here, I don't use the comprehensive variable list _numeric_ for the second array definition, but a variable list comprising variables from SASHELP.HEART only. The same variable list was (of course) used in the VAR statement of the PROC SUMMARY step. Thus, the two arrays match 1:1 and there is no need for an "offset".

 

B) The VARNUM option in the PROC CONTENTS statement of the second example is important to determine the correct specification of the abbreviated variable list, because variable lists of this kind refer to the variable order in the PDV (see the documentation I linked to).

 

AgeCHDdiag-numeric-Weight

is (in this example) a shorthand notation for 

AgeCHDdiag AgeAtStart Height Weight

namely all numeric variables in the PDV from AgeCHDdiag to Weight. I added MRW and Cholesterol arbitrarily. Obviously, with 60 variables as in your case, variable lists are particularly convenient.

 

PROC SUMMARY computes the 99% percentiles and writes them to variables P99_1, P99_2, ..., P99_6 in dataset PERCTLS (which contains only one observation). Default variables _TYPE_ and _FREQ_ of the output dataset are dropped using the DROP= dataset option and the colon abbreviation for "all variables whose names start with an underscore." The list ends with P99_6 (and not P99_888) because it corresponds 1:1 to the variable list in the VAR statement, which consists of 6 variables.

 

As described earlier, the two SET statements in the data step bring the percentiles (P99_1, ...) and the original variables of SASHELP.HEART (Status, DeathCause, AgeCHDdiag, ...) together side by side in the PDV. The second ARRAY statement must use the same variable list as was used in PROC SUMMARY. Otherwise, variable values would be -- inappropriately -- compared to percentiles of different variables!

 

The assignment statement in the DO loop replaces "extremely large values" ("outliers") by the corresponding 99% percentiles, as desired. Finally, the DROP statement removes the index variable I and the percentile variables (using the colon abbreviation for "all variables whose names start with 'P99_'"), assuming you don't need the percentiles (which are constant across all observations!) in dataset WANT. They are available in dataset PERCTLS without duplicates after all.

 

The possibility to get rid of the percentile variables using the short notation P99_: was the reason for specifying these names in the OUTPUT statement of the PROC SUMMARY step and not using the AUTONAME option instead (which would have created variable names such Height_P99, Weight_P99 etc.). However, thinking again about it, the variable list 

AgeCHDdiag_P99--Cholesterol_P99

which could replace p99_: in the DROP statement would not have been much longer. So, it might even be better to use this in conjunction with the AUTONAME option in PROC SUMMARY:

output out=perctls(drop=_:) p99= /autoname;

The advantage would be that the name of each percentile variable contains the name of the corresponding original variable, which is probably helpful when dealing with 60 variables (unless these are numbered like VAR1, ..., VAR60 anyway).

 

(Edit: only removed some vertical white space)

JonDickens1607
Obsidian | Level 7
Thank you very much for your detailed explanation.

It has been most helpful and I have applied your code to my SAS program and
it has worked perfectly.

After many frustrating hours of trial and error, it feels good to have a
simple solution which I understand and could adapt in future if I need to
do so.

##- Please type your reply above this line. Simple formatting, no
attachments. -##

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2739 views
  • 2 likes
  • 3 in conversation