BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Jerrynetwork
Obsidian | Level 7

Found a paper about logistic regression for small sample size, here is the link for the paper:

Rare Events or Non-Convergence with a Binary Outcome?

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4654-2020.pdf 

Does anyone know where can find the datasets SPARSE and SPREAD that were used in the exampleS? Thank you!

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @Jerrynetwork,

 

As @Ksharp has already suggested, it's easy to reproduce the (formatted) categorical variables of the two datasets by using the PROC FREQ outputs shown in the paper.

 

proc format;
value yesno
1='Yes'
2='No';
run;

data sparse;
do complication=1, 2;
  do procedure='New', 'Old';
    input _n_ @@;
    do _n_=1 to _n_;
      output;
    end;
  end;
end;
format complication yesno.;
cards;
0 9 30 191
;

data spread;
do event=1, 2;
  do group=1, 2;
    input _n_ @@;
    do _n_=1 to _n_;
      output;
    end;
  end;
end;
format event yesno.;
cards;
4 15 11 195
;

With these datasets you can reproduce tables 1 - 7, 10 and 11 of the paper. For table 4 add the option order=formatted to the PROC FREQ statement.

 

So, only tables 8 and 9 (involving variable AGE) remain. I'm sure other people have worked on this type of problem before -- creating data from given summary statistics -- so there must be more advanced techniques for this than I'm aware of.

 

From table 8 we get N=9, Mean=43.44 and Std=5.81 for AGE in the subgroup with vvalue(complication)='Yes'. Mostly, age values in clinical studies are integers. This together with the combination of N=9 and the decimals .44 of the mean suggest that the sum of the nine age values is 9*43.4444444...=391. The formula Var(X)=E(X²)-E(X)² applied to the discrete uniform distribution on the nine age values x1, ..., x9 yields (after multiplying with N²=81):

9*uss(of x1-x9) = sum(of x1-x9)**2 + 8*9*std(of x1-x9)**2

Given the inequality 5.805<=std(of x1-x9)<5.815 from the rounded Std value of 5.81, we conclude that

155308 <= 9*uss(of x1-x9) <= 155315

since 9*uss(of x1-x9) is an integer. But uss(of x1-x9) itself is an integer, too, and only one of the integers 155308, ..., 155315 is divisible by 9, namely 155313, hence:

uss(of x1-x9)=17257  (and std(of x1-x9)=sqrt(304)/3=5.811865...)

Number theorists could certainly tell us more about the ways 17257 can be written as a sum of 9 squares ... and even with the constraint sum(of x1-x9)=391 there will be a number of solutions.

 

Arranging and shifting integers with sum 391, centered around the rounded mean value 43 I found this particular solution for x1, ..., x9 even without letting the computer search through large numbers of combinations:

36 38 39 41 43 46 46 47 55

 

If you (unlike me) have SAS/OR, I think you can find all possible solutions for the above nine age values (assuming a reasonable age range, say, 18 - 90), tackle the second subgroup (N=221) in a similar way and ideally take table 9 of the paper into account in the optimization. Good luck and thanks for asking this inspiring question!

View solution in original post

5 REPLIES 5
Ksharp
Super User

Write e-mail to author ?

Ksharp
Super User
Or you could re-produce it by the result of PROC FREQ in paper .
FreelanceReinh
Jade | Level 19

@Ksharp wrote:
Or you could re-produce it by the result of PROC FREQ in paper .

Exactly. But the challenging (and interesting!) part is the AGE distribution in dataset SPARSE. 🙂 I'm working on that and I see chances to find solutions for the subgroup with vvalue(Complication)='Yes' (N=9).

Jerrynetwork
Obsidian | Level 7

Thank you!

FreelanceReinh
Jade | Level 19

Hello @Jerrynetwork,

 

As @Ksharp has already suggested, it's easy to reproduce the (formatted) categorical variables of the two datasets by using the PROC FREQ outputs shown in the paper.

 

proc format;
value yesno
1='Yes'
2='No';
run;

data sparse;
do complication=1, 2;
  do procedure='New', 'Old';
    input _n_ @@;
    do _n_=1 to _n_;
      output;
    end;
  end;
end;
format complication yesno.;
cards;
0 9 30 191
;

data spread;
do event=1, 2;
  do group=1, 2;
    input _n_ @@;
    do _n_=1 to _n_;
      output;
    end;
  end;
end;
format event yesno.;
cards;
4 15 11 195
;

With these datasets you can reproduce tables 1 - 7, 10 and 11 of the paper. For table 4 add the option order=formatted to the PROC FREQ statement.

 

So, only tables 8 and 9 (involving variable AGE) remain. I'm sure other people have worked on this type of problem before -- creating data from given summary statistics -- so there must be more advanced techniques for this than I'm aware of.

 

From table 8 we get N=9, Mean=43.44 and Std=5.81 for AGE in the subgroup with vvalue(complication)='Yes'. Mostly, age values in clinical studies are integers. This together with the combination of N=9 and the decimals .44 of the mean suggest that the sum of the nine age values is 9*43.4444444...=391. The formula Var(X)=E(X²)-E(X)² applied to the discrete uniform distribution on the nine age values x1, ..., x9 yields (after multiplying with N²=81):

9*uss(of x1-x9) = sum(of x1-x9)**2 + 8*9*std(of x1-x9)**2

Given the inequality 5.805<=std(of x1-x9)<5.815 from the rounded Std value of 5.81, we conclude that

155308 <= 9*uss(of x1-x9) <= 155315

since 9*uss(of x1-x9) is an integer. But uss(of x1-x9) itself is an integer, too, and only one of the integers 155308, ..., 155315 is divisible by 9, namely 155313, hence:

uss(of x1-x9)=17257  (and std(of x1-x9)=sqrt(304)/3=5.811865...)

Number theorists could certainly tell us more about the ways 17257 can be written as a sum of 9 squares ... and even with the constraint sum(of x1-x9)=391 there will be a number of solutions.

 

Arranging and shifting integers with sum 391, centered around the rounded mean value 43 I found this particular solution for x1, ..., x9 even without letting the computer search through large numbers of combinations:

36 38 39 41 43 46 46 47 55

 

If you (unlike me) have SAS/OR, I think you can find all possible solutions for the above nine age values (assuming a reasonable age range, say, 18 - 90), tackle the second subgroup (N=221) in a similar way and ideally take table 9 of the paper into account in the optimization. Good luck and thanks for asking this inspiring question!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 698 views
  • 4 likes
  • 3 in conversation