Calcite | Level 5

## Area under a pdf from a certain cutoff value over multiple simulations and output as data set

Dear SAS/IML support team,

I am using SAS 9.3.

I am conducting simulations to compare methods for cutoff estimation. One is kernel density estimation where I use the intersection of 2 pdfs as a cutoff. For these I want to estimate sensitivity and specificity or area under the curve.

I am unable to make the code run over e.g. a 1000 simulations. Thus, I need to construct some kind of loop or by statement around it, so it gives me the output "by sampledID".

The second issue I am facing is that each sample has a different cutoff, therefore I have a second data set with the cutoffs by sampleID, which at the moment I need to enter manually.

To sum it up, I have 2 data sets, 1 with sampleID 1 to x, the values and densities for each sample from the proc kde output, and a 2. data set with the cutoffs and sampleID.
Is there any possibility to manage that with SAS?
I would like to create an output data set with sensitivity/specificity by sampleID.

I would appreciate any hint you can give me or whether you think it is even possible to conduct.

Kind regards,
Tim

1 ACCEPTED SOLUTION

Accepted Solutions
SAS Super FREQ

## Re: Area under a pdf from a certain cutoff value over multiple simulations and output as data set

To answer your question, yes it is possible. The main idea is to read both data sets into IML at the same time.  You can then loop over each sampleID. For each sampleID you get the cutoff value and the subset of simulated values that correspond to the current sampleiD.  Then compute the TN and TP values for that sample, using that cutoff.

The following program shows you one way structure the program:

``````proc iml;
start TrapIntegral(x,y);
N = nrow(x);
dx    =   x[2:N] - x[1:N-1];
meanY = ( y[2:N] + y[1:N-1] )/2;
return( dx` * meanY );
finish;

/* read in cutoff limits */
use Res_kernel;
close Res_kernel;

/** read in kernel density estimate **/
use kernel_final;
read all var {marker0 density0 marker1 density1};
close kernel_final;

numSamples = nrow(cutoff_kernel);
TN = j(numSamples, 1);              /* allocate array to hold results */
TP = j(numSamples, 1);              /* allocate array to hold results */

do i = 1 to nrow(cutoff_kernel);    /* for each sampleID       */
cutoff = cutoff_kernel[i];       /* cutoff value for this sampleID */
firstObs = (i-1)*1000 + 1;       /* first obs in sample     */
lastObs = i*1000;                /* last obs in sample      */
obs = firstObs:lastObs;          /* subset for sampleID = i */
/* true negatives */
x = marker0[obs];
y = density0[obs];
idx = loc(x < cutoff);
TN[i] = TrapIntegral(x[idx],y[idx]);

/* true positives */
x = marker1[obs];
y = density1[obs];
idx = loc(x >= cutoff);
TP[i] = TrapIntegral(x[idx],y[idx]);
end;

print SampleID TN TP;``````

Incidentally, you should add ODS GRAPHICS OFF to the top of your program. That will prevent unnecessary plots from being created when you run the program interactively. (In SAS 9.4, PROC KDE creates a histogram for each BY group.)

You might want to read the article "The area under a density estimate curve" for a simpler way to compute the area. The simpler way is to estimate the empirical CDF directly from the data, rather than integrate a density estimate.

Also, I notice that you are using the difference between two density estimates to compute the cutoff value. I didn't study your code closely, but I wanted to point out that in general, two KDEs will intersect in multiple places. I'm not sure whether that is relevant to you.  In general, finding the difference between density estimates involves thinking about some subtle statistical issues. You might want to read the article "The difference of density estimates: When does it make sense?"

SAS Super FREQ

## Re: Area under a pdf from a certain cutoff value over multiple simulations and output as data set

To answer your question, yes it is possible. The main idea is to read both data sets into IML at the same time.  You can then loop over each sampleID. For each sampleID you get the cutoff value and the subset of simulated values that correspond to the current sampleiD.  Then compute the TN and TP values for that sample, using that cutoff.

The following program shows you one way structure the program:

``````proc iml;
start TrapIntegral(x,y);
N = nrow(x);
dx    =   x[2:N] - x[1:N-1];
meanY = ( y[2:N] + y[1:N-1] )/2;
return( dx` * meanY );
finish;

/* read in cutoff limits */
use Res_kernel;
close Res_kernel;

/** read in kernel density estimate **/
use kernel_final;
read all var {marker0 density0 marker1 density1};
close kernel_final;

numSamples = nrow(cutoff_kernel);
TN = j(numSamples, 1);              /* allocate array to hold results */
TP = j(numSamples, 1);              /* allocate array to hold results */

do i = 1 to nrow(cutoff_kernel);    /* for each sampleID       */
cutoff = cutoff_kernel[i];       /* cutoff value for this sampleID */
firstObs = (i-1)*1000 + 1;       /* first obs in sample     */
lastObs = i*1000;                /* last obs in sample      */
obs = firstObs:lastObs;          /* subset for sampleID = i */
/* true negatives */
x = marker0[obs];
y = density0[obs];
idx = loc(x < cutoff);
TN[i] = TrapIntegral(x[idx],y[idx]);

/* true positives */
x = marker1[obs];
y = density1[obs];
idx = loc(x >= cutoff);
TP[i] = TrapIntegral(x[idx],y[idx]);
end;

print SampleID TN TP;``````

Incidentally, you should add ODS GRAPHICS OFF to the top of your program. That will prevent unnecessary plots from being created when you run the program interactively. (In SAS 9.4, PROC KDE creates a histogram for each BY group.)

You might want to read the article "The area under a density estimate curve" for a simpler way to compute the area. The simpler way is to estimate the empirical CDF directly from the data, rather than integrate a density estimate.

Also, I notice that you are using the difference between two density estimates to compute the cutoff value. I didn't study your code closely, but I wanted to point out that in general, two KDEs will intersect in multiple places. I'm not sure whether that is relevant to you.  In general, finding the difference between density estimates involves thinking about some subtle statistical issues. You might want to read the article "The difference of density estimates: When does it make sense?"

From The DO Loop