Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-11-2015 10:41 PM
(4209 views)

Hi everyone, I need your help on some numerical integration work using Simpon's rule - with 101 grid. In the following code I simulated the data, estimated the kernel densities for variable x by group c, and visualized. I need to calculate the overlap between the the two krenel densities using numerical integration with Simpson's rule - with 101 grid. The overlap is calculated by integrating from -inf to +inf over the min of two curves MIN [ f(x_c=1), f(x_c=2) ].

```
/* Overlapping Coefficient */
/* Simulate data */
data TwoGroups;
call STREAMINIT(1982);
do i=1 to 100;
c = 1;
x = RAND("NORMAL", 10, 3);
output;
end;
do i=101 to 200;
c = 2;
x = RAND("NORMAL", 15, 3);
output;
end;
run;
/* Estimate Kernel Density */
proc kde data=TwoGroups;
univar x / out=KernelTwoGroups;
by c;
run;
/* Plot Kernel Density */
proc sgplot data=KernelTwoGroups;
series y=density x=value / group=c;
run;
```

Thank you for your help,

Metin

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can't use QUAD because there is not an analytical formula for the KDE curves.

You can look up the Simpson's rule algorithm on Wikipedia or some other source. I doubt it will make much difference. Maybe +/-0.01 from the trapezoidal algorithm. If you are concerned about precision, increase the NGRID= option on the UNIVAR statement of PROC KDE.

You have not said what you are using this for, but I would be very careful in interpreting the results. The results will be highly dependent on the bandwidth chosen for each curve. How will you explain to someone that the overlap is 0.5 if the bandwidth is 0.1, but the overlap is 0.6 is the overlap is 0.01? Reread the third-to-last paragraph in my article about the difference of density estimates.

Perhaps you should explain what statistical question you are trying to solve, and the community can recommend a test to use.

10 REPLIES 10

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for your reply Rick! References were useful. There are some challenges. There is possibilitiy that KDE curves may intersect again at the tail ends (no?). It might be better to stick with the definition of overlappping coefficient where it is defined as the area under the minimum of the two KDE curves. I have difficulty writing a code that evaluate the minimum of the y1 (KDE for c=1) and y2 (KDE for c=2) at fixed values of x. Another challenge is that there are two vector of x values, x1 (c=1) and x2(c=2). How are these can be combined such that it can be used in the integration function?Then OVL can be calculated as

OVL = TrapIntegral(x, miny1y2);

The following code is useful in the sense that it provides some idea about the overlap, but it needs to be modified for reasons above. Meanwhile, I tried XSEC, EQSEC (user defined function by Wicklin) to find matched observations in the two densities (to find the intersection for more precise integration), but I was not succesful.

```
/* Overlapping Coefficient */
/* Simulate data */
data TwoGroups;
call STREAMINIT(1982);
do i=1 to 100;
c = 1;
x = RAND("NORMAL", 10, 3);
output;
end;
do i=101 to 200;
c = 2;
x = RAND("NORMAL", 15, 3);
output;
end;
run;
* estimate kernel densities;
proc univariate data=TwoGroups;
var x;
histogram / vscale=proportion
kernel(C=SJPI) outkernel=KernelTwoGroups;
by c;
run;
* find the area under the minimum of the two curves;
proc iml;
* read in kernel density estimates;
use KernelTwoGroups;
read all var {_value_} where(c=1) into x1;
read all var {_value_} where(c=2) into x2;
read all var {_density_} where(c=1) into y1;
read all var {_density_} where(c=2) into y2;
close KernelTwoGroups;
* define trapezoid integration function;
start TrapIntegral(x,y);
N = nrow(x);
dx = x[2:N] - x[1:N-1];
meanY = ( y[2:N] + y[1:N-1] )/2;
return( dx` * meanY );
finish;
* integrate the left and right side of the intersect;
idx = loc(x1 > 13);
right = TrapIntegral(x1[idx],y1[idx]);
idx = loc(x2 < 13);
left = TrapIntegral(x1[idx],y1[idx]);
print (left + right);
quit;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Switch from PROC UNIVARIATE to PROC KDE and use the GRIDL= and GRIDU= options on the UNIVAR statement to ensure that both KDEs are evaluated on a common set of grid points, as shown in the article "The difference of density estimates." Then it is easy to find min(y1,y2). In SAS/IML there is even the "min of two vectors" operator (><), so you can simply say

miny1y2 = y1 >< y2;

as explained in the article "Compute maximum and minimum values for rows and columns in SAS."

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Rick! That was useful. While the code worked, and yielded an overlap of .48 (a different answer from previous integration, the area under left & right of the intersection of density curves, which was .64). I have encountered with the following warning

WARNING: Algorithm for Sheather-Jones Plug In did not converge; using maximum bandwidth value. WARNING: Algorithm for Sheather-Jones Plug In did not converge; using maximum bandwidth value.

I hope this does not affect the answer, if so, I hope there is a solution to it. The next challenge is to use a more precise integration algorithm such as Simpson's rule or QUAD call. The QUAD call requires a function. I am not sure how to integrate using the QUAD call in this case, or simpson's rule if it is easier. Attached is the code

```
/***************************
* Overlapping Coefficient *
***************************/
*simulate data with two groups;
data TwoGroups;
call STREAMINIT(1982);
do i=1 to 100;
c = 1;
x = RAND("NORMAL", 10, 3);
output;
end;
do i=101 to 200;
c = 2;
x = RAND("NORMAL", 15, 3);
output;
end;
run;
*transform TwoGroups data from long to wide format, to be used in PROC KDE,
create macro variables for lower and upper grid values as one std below the min of x,
and one std above the max of x;
proc sql;
create table inKDE as
select a1.x as x1, a2.x as x2
from TwoGroups a1, TwoGroups a2
where a1.c=1 and a2.c=2;
select MIN(MIN(a1.x), MIN(a2.x)) - STD(a1.x), MAX(MAX(a1.x), MAX(a2.x)) + STD(a2.x) into :gridl, :gridu
from TwoGroups a1, TwoGroups a2
where a1.c=1 and a2.c=2;
quit;
* estimate kernel density;
ods graphics on;
proc kde data=inKDE;
univar x1 x2 / plots=DENSITYOVERLAY
GRIDL=&gridl GRIDU=&gridu NGRID=401 METHOD=SJPI BWM=1
/*to justify use of SJPI cite http://www.stat.washington.edu/courses/stat527/s14/readings/Jones_etal_JASA_1996.pdf */
out=outKDE;
run;
ods graphics off;
*transform inKDE from long to wide format;
data inIML;
keep x1 x2 y1 y2;
merge outKDE(where=(var="x1") rename=(density=y1 value=x1))
outKDE(where=(var="x2") rename=(density=y2 value=x2));
run;
proc iml;
*read in kernel density estimates in wide format;
use inIML;
read all var _all_ into temp1[c=varNames];
close inIML;
*define trapezoid integration function;
start TrapIntegral(x,y);
N = nrow(x);
dx = x[2:N] - x[1:N-1];
meanY = ( y[2:N] + y[1:N-1] )/2;
return( dx` * meanY );
finish;
*find the min of the two density estimates, y1 and y2;
dataOVL = J(nrow(temp1), 2);
dataOVL[ ,1] = temp1[ ,1];
dataOVL[ ,2] = temp1[ ,2] >< temp1[ ,4];
* find the area under the MIN(f(x1), f(x2));
OVL = TrapIntegral(dataOVL[ ,1],dataOVL[ ,2]);
print OVL;
create dataOVL from dataOVL[c={"x" "miny1y2"}];
append from dataOVL;
close dataOVL;
quit;
* plot the overlap;
proc sgplot data=dataOVL;
series y=miny1y2 x=x;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can't use QUAD because there is not an analytical formula for the KDE curves.

You can look up the Simpson's rule algorithm on Wikipedia or some other source. I doubt it will make much difference. Maybe +/-0.01 from the trapezoidal algorithm. If you are concerned about precision, increase the NGRID= option on the UNIVAR statement of PROC KDE.

You have not said what you are using this for, but I would be very careful in interpreting the results. The results will be highly dependent on the bandwidth chosen for each curve. How will you explain to someone that the overlap is 0.5 if the bandwidth is 0.1, but the overlap is 0.6 is the overlap is 0.01? Reread the third-to-last paragraph in my article about the difference of density estimates.

Perhaps you should explain what statistical question you are trying to solve, and the community can recommend a test to use.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

*normal reference rule* method for the bandwidth for similiar simulations (in this example 101 grid with the *Sheather-Jones plug in* method for the bandwidth looked impractical). Variation of the overlap on the repetative samples may fluctuate beyond the range of .01 precision, and a more precise method may capture these fluctuations.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Sure, below is the R or S-plus code.

"Calculating the OVL involves estimation of two density functions evaluated at the same x values and then calculation of the overlap. The function ovl needs two input vectors of observations on the covariable for both groups (var1 and var0).We used the S‐Plus build‐in function density using the normal density rule bandwidth.nrd. For calculation of the overlap, we used Simpson’s rule on a grid of 101". Belister et al (2011, p. 1126).

Reference:

Belitser, S. V., Martens, E. P., Pestman, W. R., Groenwold, R. H., Boer, A., & Klungel, O. H. (2011). Measuring balance and model selection in propensity score methods. *Pharmacoepidemiology and drug safety*, *20*(11), 1115-1129.

.

# S-Plus Function to calculate the non-parametric overlapping coefficient with optional figure ovl <- function(group0, group1, plot=F) { wd1 <- bandwidth.nrd(group1) wd0 <- bandwidth.nrd(group0) from <- min(group1,group0) - 0.75 * mean(c(wd1,wd0)) to <- max(group1,group0) + 0.75 * mean(c(wd1,wd0))]]> d1 <- density(group1, n = 101, width=wd1, from=from, to=to) d0 <- density(group0, n = 101, width=wd0, from=from, to=to) dmin <- pmin(d1$y,d0$y) ovl <- ((d1$x[(n<-length(d1$x))]-d1$x[1])/(3*(n-1)))* (4*sum(dmin[seq(2,n,by=2)])+2*sum(dmin[seq(3,n-1,by=2)])+dmin[1]+dmin[n]) if(plot){ maxy <- max(d0$y, d1$y) minx <- min(d0$x) plot(d1, type="l", lty=1, ylim=c(0, maxy), ylab="Density", xlab="") lines(d0, lty=3) lines(d1$x, dmin, type="h") text(minx, maxy, " OVL =") text(minx+0.085*(max(d1$x)-minx), maxy, round(ovl,3)) } round(ovl,3) } # Example treated <- rnorm(100,10,3) untreated <- rnorm(100,15,5) ovl(untreated, treated, plot=T)

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks. BTW, here is the IML code for Simpson's Rule, in case you decide that you need it:

```
proc iml;
/* Simpson's Rule assumes an even number of intervals, which means
that nrow(x) should be odd. This function also assumes that the
values of x are evenly spaced, so that h = (b-a)/n */
start SimpsonIntegral(x,y);
N = nrow(x); /* odd number of points ==> even number of intervals */
if mod(N,2)=0 then abort "Not supported for an odd number of intervals.";
if N=3 then
wtsum = y[1] + 4*y[2] + y[N];
else do;
evens = do(2, N, 2);
odds = do(3, N-2, 2);
wtSum = y[1] + 4*sum(y[evens]) + 2*sum(y[odds]) + y[N];
end;
h = (x[N] - x[1]) / (N-1);
return( h/3 * wtSum );
finish;
/* Demonstrate that the method is exact for polynomials up to 3rd order */
x = do(0, 10, 2.5)`;
y = x##3;
A = SimpsonIntegral(x,y);
print A;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Fantastic! Thank you so much, Rick!

**Don't miss out on SAS Innovate - Register now for the FREE Livestream!**

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.