Background:
I am using GENMOD to run a GEE on survey weighted data. My code below includes the correlated aspect of my data which is hospital #. I am using a national dataset, and the total observations are ~54 million and ~50 variables (I cut out many already but I realize I should've cut out all but the ones absolutely needed)
ods graphics off;
ods exclude all;
ods results off;
sasfile work.finalscabies open;
proc genmod data=finalscabies;
class hosp_nis Nis_stratum race homeless female agecat pay1 zipinc_qrtl / descending;
model scabies(event='1')= Race female AgeCat Pay1 homeless ZIPINC_QRTL / dist=bin link=logit maxiter=10;
weight discwt;
repeated subject=Hosp_Nis(nis_stratum) / TYPE=EXCH;
estimate 'Black' race 0 0 0 0 1 -1 /exp;
estimate 'Hispanic' race 0 0 0 1 0 -1 /exp;
estimate 'API' race 0 0 1 0 0 -1 /exp;
estimate 'NA' race 0 1 0 0 0 -1 /exp;
lsmeans homeless / OR cl;
lsmeans female / OR cl;
lsmeans agecat / OR cl;
lsmeans pay1 / OR diff=all cl;
lsmeans ZIPINC_QRTL / OR diff=all cl;
ods output Estimates=estimateESTscabies GEEEmpPEst=GEEest GEEFitCritera=GEEFit LSMeans=OR1 Diffs=ORDiffs;
run;
ods exclude none;
sasfile work.finalscabies close;
Dilemma:
I ran the GEE without any of the estimate statements and without the SASFILE line to load it to memory. This took 1 hour to run and output.
I added 4 estimate statements and ran the code and it took 33 hours. Unfortunately I still had tweaks to make. I added the lines lsmeans lines, changed the ods settings to hopefully increase performance and ran the code again. This took 48 hours, at which point my computer did an automatic update without me realizing and all was lost.
Finally, I added the SASFILE line to load this massive file (24 GB) to memory of which I have 40GB of RAM.
I ran the code for the 3rd time and it's currently running at 36 hours elapsed. Here's where the real question comes in. The log only shows:
NOTE: Writing HTML5(EGHTML) Body file: EGHTML
27
28 ods graphics off;
29 ods exclude all
30 ods results off;
In prior runs of the code, the log would show:
"NOTE: Algorithm Converged" after approximately 15-20 minutes.
What is my log reflecting? Surely it must be running the GENMOD after 36 hours?
Additional Information:
Looking at my RAM usage I saw an initial increase to 28GB used early on in the run. It's now down to 18GB use.
Thank you for anyone who can provide advice, suggestions, or an answer!!
(Yes I will remove the extra variables if for some reason I have to run this again)
GENMOD cannot provide a valid analysis of survey data. For survey data, only the SURVEY procedures (SURVEYFREQ, SURVEYLOGISTIC, etc.) can provide a proper analysis of survey sample data. A variable specified in the WEIGHT statement in other procedures may produce correct parameter estimates, but their variances will not be correct. Special variance estimators are needed in the analysis of survey data and only the SURVEY procedures have these estimators.
Thank you for this prompt reply.
I am basing my analysis off of this paper which describes the use of GENMOD for survey-weighted data.
https://support.sas.com/resources/papers/proceedings13/272-2013.pdf
The intent of using GENDMOD and GEE is also to explore how to data compares using an exchangeable correlation structure compared to an identity structure. Should the results be similar I will likely perform a final analysis in SURVEYLOGISTIC.
I appreciate any further thoughts or suggestions.
@culliso3 wrote:
Thank you for this prompt reply.
I am basing my analysis off of this paper which describes the use of GENMOD for survey-weighted data.
https://support.sas.com/resources/papers/proceedings13/272-2013.pdf
The intent of using GENDMOD and GEE is also to explore how to data compares using an exchangeable correlation structure compared to an identity structure. Should the results be similar I will likely perform a final analysis in SURVEYLOGISTIC.
I appreciate any further thoughts or suggestions.
What does the documentation of your data source say about the sample design? Was it stratified? Clustered? Something other than a simple random sample like PPS or systematic?
If you will look a bit closer at the paper you will see the Surveyfreq example did not use any of the survey design elements such as STRATA or Cluster. Which can make a noticeable difference in results. Not to mention the NOMCAR option that is used to indicate that some missing values are not at random. Many surveys have skip patterns where some questions are only asked based on responses to other questions. So some of the data is missing systematically and without the proper adjustments for such results may be off.
Thank you for your reply ballardw,
I agree it is curious as to why they didn't include the strata or cluster elements, although the paragraph below output 2 does acknowledge their importance for calculation accuracy.
The documentation of the data set describes the sampling as follows:
The universe is separated into strata. Strata are determined upon 5 variables. Multiple individual clusters belong within each strata. Each cluster represents a hospital. Sampling of data from each cluster is based on "a probability sample of all clusters within a frame, with sampling probabilities proportional to the number of clusters in each stratum".
My REPEATED statement is structured such that each cluster [Hosp_nis] is nested within each stratum [Nis_stratum], which although I may certainly be incorrect, is structured suitably to account for this sampling design.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.