Found a paper about logistic regression for small sample size, here is the link for the paper:
Rare Events or Non-Convergence with a Binary Outcome?
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/4654-2020.pdf
Does anyone know where can find the datasets SPARSE and SPREAD that were used in the exampleS? Thank you!
Hello @Jerrynetwork,
As @Ksharp has already suggested, it's easy to reproduce the (formatted) categorical variables of the two datasets by using the PROC FREQ outputs shown in the paper.
proc format;
value yesno
1='Yes'
2='No';
run;
data sparse;
do complication=1, 2;
do procedure='New', 'Old';
input _n_ @@;
do _n_=1 to _n_;
output;
end;
end;
end;
format complication yesno.;
cards;
0 9 30 191
;
data spread;
do event=1, 2;
do group=1, 2;
input _n_ @@;
do _n_=1 to _n_;
output;
end;
end;
end;
format event yesno.;
cards;
4 15 11 195
;
With these datasets you can reproduce tables 1 - 7, 10 and 11 of the paper. For table 4 add the option order=formatted to the PROC FREQ statement.
So, only tables 8 and 9 (involving variable AGE) remain. I'm sure other people have worked on this type of problem before -- creating data from given summary statistics -- so there must be more advanced techniques for this than I'm aware of.
From table 8 we get N=9, Mean=43.44 and Std=5.81 for AGE in the subgroup with vvalue(complication)='Yes'. Mostly, age values in clinical studies are integers. This together with the combination of N=9 and the decimals .44 of the mean suggest that the sum of the nine age values is 9*43.4444444...=391. The formula Var(X)=E(X²)-E(X)² applied to the discrete uniform distribution on the nine age values x1, ..., x9 yields (after multiplying with N²=81):
9*uss(of x1-x9) = sum(of x1-x9)**2 + 8*9*std(of x1-x9)**2
Given the inequality 5.805<=std(of x1-x9)<5.815 from the rounded Std value of 5.81, we conclude that
155308 <= 9*uss(of x1-x9) <= 155315
since 9*uss(of x1-x9) is an integer. But uss(of x1-x9) itself is an integer, too, and only one of the integers 155308, ..., 155315 is divisible by 9, namely 155313, hence:
uss(of x1-x9)=17257 (and std(of x1-x9)=sqrt(304)/3=5.811865...)
Number theorists could certainly tell us more about the ways 17257 can be written as a sum of 9 squares ... and even with the constraint sum(of x1-x9)=391 there will be a number of solutions.
Arranging and shifting integers with sum 391, centered around the rounded mean value 43 I found this particular solution for x1, ..., x9 even without letting the computer search through large numbers of combinations:
36 38 39 41 43 46 46 47 55
If you (unlike me) have SAS/OR, I think you can find all possible solutions for the above nine age values (assuming a reasonable age range, say, 18 - 90), tackle the second subgroup (N=221) in a similar way and ideally take table 9 of the paper into account in the optimization. Good luck and thanks for asking this inspiring question!
Write e-mail to author ?
@Ksharp wrote:
Or you could re-produce it by the result of PROC FREQ in paper .
Exactly. But the challenging (and interesting!) part is the AGE distribution in dataset SPARSE. 🙂 I'm working on that and I see chances to find solutions for the subgroup with vvalue(Complication)='Yes' (N=9).
Thank you!
Hello @Jerrynetwork,
As @Ksharp has already suggested, it's easy to reproduce the (formatted) categorical variables of the two datasets by using the PROC FREQ outputs shown in the paper.
proc format;
value yesno
1='Yes'
2='No';
run;
data sparse;
do complication=1, 2;
do procedure='New', 'Old';
input _n_ @@;
do _n_=1 to _n_;
output;
end;
end;
end;
format complication yesno.;
cards;
0 9 30 191
;
data spread;
do event=1, 2;
do group=1, 2;
input _n_ @@;
do _n_=1 to _n_;
output;
end;
end;
end;
format event yesno.;
cards;
4 15 11 195
;
With these datasets you can reproduce tables 1 - 7, 10 and 11 of the paper. For table 4 add the option order=formatted to the PROC FREQ statement.
So, only tables 8 and 9 (involving variable AGE) remain. I'm sure other people have worked on this type of problem before -- creating data from given summary statistics -- so there must be more advanced techniques for this than I'm aware of.
From table 8 we get N=9, Mean=43.44 and Std=5.81 for AGE in the subgroup with vvalue(complication)='Yes'. Mostly, age values in clinical studies are integers. This together with the combination of N=9 and the decimals .44 of the mean suggest that the sum of the nine age values is 9*43.4444444...=391. The formula Var(X)=E(X²)-E(X)² applied to the discrete uniform distribution on the nine age values x1, ..., x9 yields (after multiplying with N²=81):
9*uss(of x1-x9) = sum(of x1-x9)**2 + 8*9*std(of x1-x9)**2
Given the inequality 5.805<=std(of x1-x9)<5.815 from the rounded Std value of 5.81, we conclude that
155308 <= 9*uss(of x1-x9) <= 155315
since 9*uss(of x1-x9) is an integer. But uss(of x1-x9) itself is an integer, too, and only one of the integers 155308, ..., 155315 is divisible by 9, namely 155313, hence:
uss(of x1-x9)=17257 (and std(of x1-x9)=sqrt(304)/3=5.811865...)
Number theorists could certainly tell us more about the ways 17257 can be written as a sum of 9 squares ... and even with the constraint sum(of x1-x9)=391 there will be a number of solutions.
Arranging and shifting integers with sum 391, centered around the rounded mean value 43 I found this particular solution for x1, ..., x9 even without letting the computer search through large numbers of combinations:
36 38 39 41 43 46 46 47 55
If you (unlike me) have SAS/OR, I think you can find all possible solutions for the above nine age values (assuming a reasonable age range, say, 18 - 90), tackle the second subgroup (N=221) in a similar way and ideally take table 9 of the paper into account in the optimization. Good luck and thanks for asking this inspiring question!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.