- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I have been working on some product purchase propensity models and have started using actual Dollar amount variables. It was recommended to me that a cap extreme/ outlining values before inputting them into a model. I literally have hundreds of dollar value variables and in order to do this it will obviously take a tremendous amount of time.
Does anyone have a macro or familiar with a macro which can accomplish this large task? I have been searching around the web and having trouble finding a macro and the ones I have found are difficult to understand (I’m not much of a macro guy).
So,
1.a macro which will determine the appropriate “cap” based on the dollar distributions and then “cap” the variables
and/or
2.a macro which will let me select a percentile value (say 99) and then cap all variables at or hier than that value
Any assistance will be greatly appreciated!! Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Replace the sections in RED. You need to provide the data set name and the variable list to the proc means. In the data step at the bottom, make sure to also specify the dataset to be capped. The rest of the code is run as is.
*Calculate IQR and first/third quartiles;
proc means data=[YOUR DATA SET NAME] stackods n qrange p1 p99;
var [YOUR VARIABLE LIST GOES HERE];
ods output summary=ranges;
run;
*macro to cap outliers, run the rest as is
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p1;
upper=p99;
string = catt('%cap(dset=[Your Dataset Name], var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm actually probably just looking for a macro which will cap all values above the 99th percentile not a macro which determine the appropriate cap and then cap the vars
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here's one I wrote that does it for upper/lower. Change the PROC MEANS if you want 99% and 1%.
I don't know how efficient it is in terms of code, but in terms of your programming time, it's pretty much exactly what you want. I think Rick Wicklin has a recent post on winsorization. http://blogs.sas.com/content/iml/2017/02/08/winsorization-good-bad-and-ugly.html
As noted, not sure if this handles missing correctly - I suspect it doesn't but it would be an easy fix.
https://gist.github.com/statgeek/31316a678433a1db8136
*Calculate IQR and first/third quartiles;
proc means data=sashelp.class stackods n qrange p1 p99;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p1;
upper=p99;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Rezza,
Thanks so much.
I'm a little confused how this works
proc means data=sashelp.class stackods n qrange p1 p99;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
I have millions of observations and Hundreds of variables I cant do this check for each observation
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
Where do I input the variable list?
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
Is the syntax I need to add look like this?
%cap(daset=have var=list of vars lower=P1 upper=P99);
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p1;
upper=p99;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@RobertNYC wrote:
Hi Rezza,
Thanks so much.
I'm a little confused how this works
proc means data=sashelp.class stackods n qrange p1 p99;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
I have millions of observations and Hundreds of variables I cant do this check for each observationThis is to create a sample dataset to test. You don't need this step, you use your original data
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
Where do I input the variable list?
Variable list should be provided in the proc means VAR statement in the first step
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;%mend;
Is the syntax I need to add look like this?
%cap(daset=have var=list of vars lower=P1 upper=P99);
No, the macro is automatically executed for every variable in the output from the proc means. Var is designed for a single variable, not multiple variables.
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p1;
upper=p99;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;
I'll repost an answer that shows how you would use it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Replace the sections in RED. You need to provide the data set name and the variable list to the proc means. In the data step at the bottom, make sure to also specify the dataset to be capped. The rest of the code is run as is.
*Calculate IQR and first/third quartiles;
proc means data=[YOUR DATA SET NAME] stackods n qrange p1 p99;
var [YOUR VARIABLE LIST GOES HERE];
ods output summary=ranges;
run;
*macro to cap outliers, run the rest as is
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p1;
upper=p99;
string = catt('%cap(dset=[Your Dataset Name], var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;