Programming the statistical procedures from SAS

Proc logistic, how to get the observations with ties?

Reply
Regular Contributor
Posts: 187

Proc logistic, how to get the observations with ties?

Hi,

 

I built a logistic model and the number of ties are about 24%. How can I identify the observations which have ties, so that I can analyse them?

 

 

Grand Advisor
Posts: 9,444

Re: Proc logistic, how to get the observations with ties?

How do you define this TIES ? the obs have the same value in all the variables ?

Regular Contributor
Posts: 187

Re: Proc logistic, how to get the observations with ties?

No, I am talking about the section of logistic model, which tells about, concordant, dis-concordant and ties. Which are calculated from pairs of scored probabilities of target 1 and 0.
Grand Advisor
Posts: 16,838

Re: Proc logistic, how to get the observations with ties?

  1. Get the scored data set - with predicted probabilities. Look at output statement. 
  2. Run a proc freq to generate Ctable
  3. Extract ties...
Regular Contributor
Posts: 187

Re: Proc logistic, how to get the observations with ties?

proc freq to run ctable?, Can you please explain that?
Trusted Advisor
Posts: 1,114

Re: Proc logistic, how to get the observations with ties?

[ Edited ]

Hi @munitech4u,

 

Here is an example:

 

Let's take dataset REMISSION from the PROC LOGISTIC documentation as a basis.

 

/* Add an ID to identify observations */
data Remission;
set Remission;
id=_n_;
run; /* 27 obs. */

/* Run an arbitrary logistic regression, 
   write predicted probabilities to dataset PRED */
proc logistic data=Remission;
   model remiss(event='1')=li;
   output out=pred p=p;
run;

/* Create dataset TIES with "tied" pairs of IDs */
proc sql;
create table ties as
select a.id as id1, b.id as id2
from pred a, pred b
where a.id<b.id & a.remiss ne b.remiss & a.p=b.p;
quit; /* 5 obs. */

 

Alternatively, you could create a dataset with all relevant pairs: 

/* Create dataset PAIRS with all pairs of IDs considered in output table
   "Association of Predicted Probabilities and Observed Responses" */
proc sql;
create table pairs as
select a.id as id1, b.id as id2, a.p as p1, b.p as p2, a.remiss as r1, b.remiss as r2,
       case when r1=1 & r2=0 & p1>p2 | r1=0 & r2=1 & p1<p2 then 'Concordant'
            when r1=1 & r2=0 & p1<p2 | r1=0 & r2=1 & p1>p2 then 'Discordant'
            else 'Tied' end as assoc
from pred a, pred b
where a.id<b.id & a.remiss ne b.remiss;
quit; /* 162 obs. */

proc freq data=pairs;
tables assoc;
run;

 

Result: 

                                       Cumulative    Cumulative
assoc         Frequency     Percent     Frequency      Percent
---------------------------------------------------------------
Concordant         136       83.95           136        83.95
Discordant          21       12.96           157        96.91
Tied                 5        3.09           162       100.00

This corresponds to table "Association of Predicted Probabilities and Observed Responses" in Output 72.1.2 (see link above).

 

(Edit: just improved layout)

Regular Contributor
Posts: 187

Re: Proc logistic, how to get the observations with ties?

Thanks, but do you recommend running it on a dataset as large as 4 million?
Trusted Advisor
Posts: 1,114

Re: Proc logistic, how to get the observations with ties?

[ Edited ]

munitech4u wrote:
Thanks, but do you recommend running it on a dataset as large as 4 million?

No, given this new information I would choose a different approach:

/* "Blow up" the test dataset and add an ID to identify observations */
data Remission;
set Remission;
do i=1 to 148149;
  id=(_n_-1)*148149+i;
  output;
end;
drop i;
run; /* 4000023 obs. */

/* Run an arbitrary logistic regression, 
   write predicted probabilities to dataset PRED */
proc logistic data=Remission;
model remiss(event='1')=li;
output out=pred p=p;
run;

/* Select "tied" observations */
proc sql;
create table tied_obs(drop=_level_) as
select *
from pred
group by p
having count(distinct remiss)>1;
quit; /* 1185192 obs. */

This has the additional advantage that you have the other variables from dataset PRED in dataset TIED_OBS, so you can start your analysis immediately.

 

Edit: Simplified HAVING condition: count(*)>1 was redundant.

Ask a Question
Discussion stats
  • 7 replies
  • 360 views
  • 3 likes
  • 4 in conversation