Re: Proc logistic, how to get the observations with ties?

munitech4u · Posted 03-17-2016 03:49 AM

Hi,

I built a logistic model and the number of ties are about 24%. How can I identify the observations which have ties, so that I can analyse them?

Ksharp · Posted 03-17-2016 05:33 AM

How do you define this TIES ? the obs have the same value in all the variables ?

munitech4u · Posted 03-17-2016 06:58 AM

No, I am talking about the section of logistic model, which tells about, concordant, dis-concordant and ties. Which are calculated from pairs of scored probabilities of target 1 and 0.

Reeza · Posted 03-17-2016 07:32 AM

Get the scored data set - with predicted probabilities. Look at output statement.
Run a proc freq to generate Ctable
Extract ties...

munitech4u · Posted 03-17-2016 09:10 AM

proc freq to run ctable?, Can you please explain that?

FreelanceReinh · Posted 03-17-2016 08:23 AM

Hi @munitech4u,

Here is an example:

Let's take dataset REMISSION from the PROC LOGISTIC documentation as a basis.

/* Add an ID to identify observations */
data Remission;
set Remission;
id=_n_;
run; /* 27 obs. */

/* Run an arbitrary logistic regression, 
   write predicted probabilities to dataset PRED */
proc logistic data=Remission;
   model remiss(event='1')=li;
   output out=pred p=p;
run;

/* Create dataset TIES with "tied" pairs of IDs */
proc sql;
create table ties as
select a.id as id1, b.id as id2
from pred a, pred b
where a.id<b.id & a.remiss ne b.remiss & a.p=b.p;
quit; /* 5 obs. */

Alternatively, you could create a dataset with all relevant pairs:

/* Create dataset PAIRS with all pairs of IDs considered in output table
   "Association of Predicted Probabilities and Observed Responses" */
proc sql;
create table pairs as
select a.id as id1, b.id as id2, a.p as p1, b.p as p2, a.remiss as r1, b.remiss as r2,
       case when r1=1 & r2=0 & p1>p2 | r1=0 & r2=1 & p1<p2 then 'Concordant'
            when r1=1 & r2=0 & p1<p2 | r1=0 & r2=1 & p1>p2 then 'Discordant'
            else 'Tied' end as assoc
from pred a, pred b
where a.id<b.id & a.remiss ne b.remiss;
quit; /* 162 obs. */

proc freq data=pairs;
tables assoc;
run;

Result:

                                       Cumulative    Cumulative
assoc         Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
Concordant         136       83.95           136        83.95
Discordant          21       12.96           157        96.91
Tied                 5        3.09           162       100.00

This corresponds to table "Association of Predicted Probabilities and Observed Responses" in Output 72.1.2 (see link above).

(Edit: just improved layout)

munitech4u · Posted 03-17-2016 09:09 AM

Thanks, but do you recommend running it on a dataset as large as 4 million?

FreelanceReinh · Posted 03-17-2016 09:52 AM

@munitech4u wrote:
Thanks, but do you recommend running it on a dataset as large as 4 million?

No, given this new information I would choose a different approach:

/* "Blow up" the test dataset and add an ID to identify observations */
data Remission;
set Remission;
do i=1 to 148149;
  id=(_n_-1)*148149+i;
  output;
end;
drop i;
run; /* 4000023 obs. */

/* Run an arbitrary logistic regression, 
   write predicted probabilities to dataset PRED */
proc logistic data=Remission;
model remiss(event='1')=li;
output out=pred p=p;
run;

/* Select "tied" observations */
proc sql;
create table tied_obs(drop=_level_) as
select *
from pred
group by p
having count(distinct remiss)>1;
quit; /* 1185192 obs. */

This has the additional advantage that you have the other variables from dataset PRED in dataset TIED_OBS, so you can start your analysis immediately.

Edit: Simplified HAVING condition: count(*)>1 was redundant.