Solved: Overlap of the two kernel densities

MetinBulus · Posted 09-11-2015 10:41 PM

Hi everyone, I need your help on some numerical integration work using Simpon's rule - with 101 grid. In the following code I simulated the data, estimated the kernel densities for variable x by group c, and visualized. I need to calculate the overlap between the the two krenel densities using numerical integration with Simpson's rule - with 101 grid. The overlap is calculated by integrating from -inf to +inf over the min of two curves MIN [ f(x_c=1), f(x_c=2) ].

/* Overlapping Coefficient */
/* Simulate data */
data TwoGroups;
	call STREAMINIT(1982);
	do i=1 to 100;
		c = 1;
		x = RAND("NORMAL", 10, 3);
		output;
	end;
	do i=101 to 200;
		c = 2;
		x = RAND("NORMAL", 15, 3);
		output;
	end;
run;

/* Estimate Kernel Density */
proc kde data=TwoGroups; 
   univar x / out=KernelTwoGroups;
   by c;
run;

/* Plot Kernel Density */
proc sgplot data=KernelTwoGroups;
	series y=density x=value / group=c;
run;

Thank you for your help,

Metin

Rick_SAS · Posted 09-14-2015 06:06 AM

You can't use QUAD because there is not an analytical formula for the KDE curves.

You can look up the Simpson's rule algorithm on Wikipedia or some other source. I doubt it will make much difference. Maybe +/-0.01 from the trapezoidal algorithm. If you are concerned about precision, increase the NGRID= option on the UNIVAR statement of PROC KDE.

You have not said what you are using this for, but I would be very careful in interpreting the results. The results will be highly dependent on the bandwidth chosen for each curve. How will you explain to someone that the overlap is 0.5 if the bandwidth is 0.1, but the overlap is 0.6 is the overlap is 0.01? Reread the third-to-last paragraph in my article about the difference of density estimates.

Perhaps you should explain what statistical question you are trying to solve, and the community can recommend a test to use.

View solution in original post

Rick_SAS · Posted 09-12-2015 07:49 PM

I show how to integrate a KDE in the article "The area under a density estimate curve." (You can skip the third section unless you are still running SAS 9.2.) In this case, just perform two integrations. The first is the area under the c=2 curve from the beginning (about x=7) until the KDE's intersect (near 13). The other is the area under the c=1 curve from that intersection point until the end of the curve.

MetinBulus · Posted 09-13-2015 12:42 AM

Thank you for your reply Rick! References were useful. There are some challenges. There is possibilitiy that KDE curves may intersect again at the tail ends (no?). It might be better to stick with the definition of overlappping coefficient where it is defined as the area under the minimum of the two KDE curves. I have difficulty writing a code that evaluate the minimum of the y1 (KDE for c=1) and y2 (KDE for c=2) at fixed values of x. Another challenge is that there are two vector of x values, x1 (c=1) and x2(c=2). How are these can be combined such that it can be used in the integration function?Then OVL can be calculated as

OVL = TrapIntegral(x, miny1y2);

The following code is useful in the sense that it provides some idea about the overlap, but it needs to be modified for reasons above. Meanwhile, I tried XSEC, EQSEC (user defined function by Wicklin) to find matched observations in the two densities (to find the intersection for more precise integration), but I was not succesful.

/* Overlapping Coefficient */
/* Simulate data */
data TwoGroups;
	call STREAMINIT(1982);
	do i=1 to 100;
		c = 1;
		x = RAND("NORMAL", 10, 3);
		output;
	end;
	do i=101 to 200;
		c = 2;
		x = RAND("NORMAL", 15, 3);
		output;
	end;
run;

* estimate kernel densities;
proc univariate data=TwoGroups;
	var x;
	histogram / vscale=proportion 
    kernel(C=SJPI) outkernel=KernelTwoGroups;
    by c;
run;

* find the area under the minimum of the two curves;
proc iml;
	* read in kernel density estimates;
	use KernelTwoGroups;
	read all var {_value_} where(c=1) into x1;
	read all var {_value_} where(c=2) into x2;
	read all var {_density_} where(c=1) into y1;
	read all var {_density_} where(c=2) into y2;
	close KernelTwoGroups;
	
	* define trapezoid integration function;
	start TrapIntegral(x,y);
	   N = nrow(x);
	   dx    =   x[2:N] - x[1:N-1];
	   meanY = ( y[2:N] + y[1:N-1] )/2;
	   return( dx` * meanY );
	finish;
	
	* integrate the left and right side of the intersect;
	idx = loc(x1 > 13);
	right = TrapIntegral(x1[idx],y1[idx]);
	idx = loc(x2 < 13);
	left = TrapIntegral(x1[idx],y1[idx]);
	print (left + right);
quit;

Rick_SAS · Posted 09-13-2015 07:03 AM

Switch from PROC UNIVARIATE to PROC KDE and use the GRIDL= and GRIDU= options on the UNIVAR statement to ensure that both KDEs are evaluated on a common set of grid points, as shown in the article "The difference of density estimates." Then it is easy to find min(y1,y2). In SAS/IML there is even the "min of two vectors" operator (><), so you can simply say

miny1y2 = y1 >< y2;

as explained in the article "Compute maximum and minimum values for rows and columns in SAS."

MetinBulus · Posted 09-13-2015 11:08 PM

Thanks Rick! That was useful. While the code worked, and yielded an overlap of .48 (a different answer from previous integration, the area under left & right of the intersection of density curves, which was .64). I have encountered with the following warning

 WARNING: Algorithm for Sheather-Jones Plug In did not converge; using maximum bandwidth value.
 WARNING: Algorithm for Sheather-Jones Plug In did not converge; using maximum bandwidth value.

I hope this does not affect the answer, if so, I hope there is a solution to it. The next challenge is to use a more precise integration algorithm such as Simpson's rule or QUAD call. The QUAD call requires a function. I am not sure how to integrate using the QUAD call in this case, or simpson's rule if it is easier. Attached is the code

/***************************
 * Overlapping Coefficient *
 ***************************/
 
*simulate data with two groups;
data TwoGroups;
	call STREAMINIT(1982);
	do i=1 to 100;
		c = 1;
		x = RAND("NORMAL", 10, 3);
		output;
	end;
	do i=101 to 200;
		c = 2;
		x = RAND("NORMAL", 15, 3);
		output;
	end;
run;

*transform TwoGroups data from long to wide format, to be used in PROC KDE,
 create macro variables for lower and upper grid values as one std below the min of x,
 and one std above the max of x;
proc sql;
	create table inKDE as
		select a1.x as x1, a2.x as x2
			from TwoGroups a1, TwoGroups a2
			where a1.c=1 and a2.c=2;
		select MIN(MIN(a1.x), MIN(a2.x)) - STD(a1.x), MAX(MAX(a1.x), MAX(a2.x)) + STD(a2.x) into :gridl, :gridu
			from TwoGroups a1, TwoGroups a2
			where a1.c=1 and a2.c=2;
quit;

* estimate kernel density;
ods graphics on;
proc kde data=inKDE; 
   univar x1 x2 / plots=DENSITYOVERLAY 
   		GRIDL=&gridl  GRIDU=&gridu NGRID=401 METHOD=SJPI BWM=1
   		/*to justify use of SJPI cite http://www.stat.washington.edu/courses/stat527/s14/readings/Jones_etal_JASA_1996.pdf */
   		out=outKDE;
run;
ods graphics off;

*transform inKDE from long to wide format;
data inIML;
 keep x1 x2 y1 y2;
 merge outKDE(where=(var="x1") rename=(density=y1 value=x1))
       outKDE(where=(var="x2") rename=(density=y2 value=x2));
run;

proc iml;
	*read in kernel density estimates in wide format;
	use inIML;
	read all var _all_ into temp1[c=varNames];
	close inIML;
	*define trapezoid integration function;
	start TrapIntegral(x,y);
	   N = nrow(x);
	   dx    =   x[2:N] - x[1:N-1];
	   meanY = ( y[2:N] + y[1:N-1] )/2;
	   return( dx` * meanY );
	finish;
	*find the min of the two density estimates, y1 and y2;
	dataOVL = J(nrow(temp1), 2); 
	dataOVL[ ,1] = temp1[ ,1];
	dataOVL[ ,2] = temp1[ ,2] >< temp1[ ,4];
	* find the area under the MIN(f(x1), f(x2));
	OVL = TrapIntegral(dataOVL[ ,1],dataOVL[ ,2]);
	print OVL;
	create dataOVL from dataOVL[c={"x" "miny1y2"}];
	append from dataOVL;
	close dataOVL;
quit;
* plot the overlap;
proc sgplot data=dataOVL;
	series y=miny1y2 x=x;
run;

Rick_SAS · Posted 09-14-2015 06:06 AM

You can't use QUAD because there is not an analytical formula for the KDE curves.

You can look up the Simpson's rule algorithm on Wikipedia or some other source. I doubt it will make much difference. Maybe +/-0.01 from the trapezoidal algorithm. If you are concerned about precision, increase the NGRID= option on the UNIVAR statement of PROC KDE.

You have not said what you are using this for, but I would be very careful in interpreting the results. The results will be highly dependent on the bandwidth chosen for each curve. How will you explain to someone that the overlap is 0.5 if the bandwidth is 0.1, but the overlap is 0.6 is the overlap is 0.01? Reread the third-to-last paragraph in my article about the difference of density estimates.

Perhaps you should explain what statistical question you are trying to solve, and the community can recommend a test to use.

MetinBulus · Posted 09-14-2015 12:31 PM

Thank you Rick. I am conducting a quasi-experimental simulation where I use overlapping coefficient as a measure of balance on the baseline covariates. Most pople in the field used Simpson's rule using 101 grid, and normal reference rule method for the bandwidth for similiar simulations (in this example 101 grid with the Sheather-Jones plug in method for the bandwidth looked impractical). Variation of the overlap on the repetative samples may fluctuate beyond the range of .01 precision, and a more precise method may capture these fluctuations.

Rick_SAS · Posted 09-14-2015 01:24 PM

Interesting information. Can you supply a reference to a paper in a referreed journal or a textbook that recommends this method?

MetinBulus · Posted 09-14-2015 03:00 PM

Sure, below is the R or S-plus code.

"Calculating the OVL involves estimation of two density functions evaluated at the same x values and then calculation of the overlap. The function ovl needs two input vectors of observations on the covariable for both groups (var1 and var0).We used the S‐Plus build‐in function density using the normal density rule bandwidth.nrd. For calculation of the overlap, we used Simpson’s rule on a grid of 101". Belister et al (2011, p. 1126).

Reference:

Belitser, S. V., Martens, E. P., Pestman, W. R., Groenwold, R. H., Boer, A., & Klungel, O. H. (2011). Measuring balance and model selection in propensity score methods. Pharmacoepidemiology and drug safety, 20(11), 1115-1129.

.

# S-Plus Function to calculate the non-parametric overlapping coefficient with optional
figure
ovl <- function(group0, group1, plot=F) {
wd1 <- bandwidth.nrd(group1)
wd0 <- bandwidth.nrd(group0)
from <- min(group1,group0) - 0.75 * mean(c(wd1,wd0))
to <- max(group1,group0) + 0.75 * mean(c(wd1,wd0))]]>
d1 <- density(group1, n = 101, width=wd1, from=from, to=to)
d0 <- density(group0, n = 101, width=wd0, from=from, to=to)
dmin <- pmin(d1$y,d0$y)
ovl <- ((d1$x[(n<-length(d1$x))]-d1$x[1])/(3*(n-1)))*
(4*sum(dmin[seq(2,n,by=2)])+2*sum(dmin[seq(3,n-1,by=2)])+dmin[1]+dmin[n])
if(plot){
maxy <- max(d0$y, d1$y)
minx <- min(d0$x)
plot(d1, type="l", lty=1, ylim=c(0, maxy), ylab="Density", xlab="")
lines(d0, lty=3)
lines(d1$x, dmin, type="h")
text(minx, maxy, " OVL =")
text(minx+0.085*(max(d1$x)-minx), maxy, round(ovl,3))
}
round(ovl,3)
}
# Example
treated <- rnorm(100,10,3)
untreated <- rnorm(100,15,5)
ovl(untreated, treated, plot=T)

Rick_SAS · Posted 09-14-2015 03:18 PM

Thanks. BTW, here is the IML code for Simpson's Rule, in case you decide that you need it:

proc iml;
/* Simpson's Rule assumes an even number of intervals, which means
   that nrow(x) should be odd. This function also assumes that the 
   values of x are evenly spaced, so that h = (b-a)/n */
start SimpsonIntegral(x,y);
   N = nrow(x);    /* odd number of points ==> even number of intervals */
   if mod(N,2)=0 then abort "Not supported for an odd number of intervals.";
   if N=3 then 
      wtsum = y[1] + 4*y[2] + y[N];
   else do;
      evens = do(2, N, 2);
      odds  = do(3, N-2, 2);
      wtSum = y[1] + 4*sum(y[evens]) + 2*sum(y[odds]) + y[N];
   end;
   h = (x[N] - x[1]) / (N-1);
   return( h/3 * wtSum );
finish;
 
/* Demonstrate that the method is exact for polynomials up to 3rd order */
x = do(0, 10, 2.5)`; 
y = x##3;
A = SimpsonIntegral(x,y);
print A;

MetinBulus · Posted 09-14-2015 03:31 PM

Fantastic! Thank you so much, Rick!

Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Re: Overlap of the two kernel densities

Catch up on SAS Innovate 2026