SAS Support Communities

zemastear · ‎10-28-2014

Thank you for all your answers so far Steve. It has given me enough information to work on with for now

zemastear · ‎10-23-2014

I don't quite understand what you mean, as I am not familiar at all with clustering procedures. Can you roughly explain what they are and what they do? In the meantime, I will look around for myself too.

zemastear · ‎10-23-2014

Regarding the extreme value distributions: I read about Weibull when I was looking around for information about this matter, but I have no clue what it does or how to use it. Thanks for referring to Wikipedia. Regarding the Saphiro-Wilk test: I know that’s the best one to use and good idea setting the alpha very low. The ‘problem’ is we have some risks with more than 2000 observations, for which the S-W-test doesn’t work. Are there alternatives?

zemastear · ‎10-22-2014

Thanks a lot for the swift reply Steve. Regarding your 1st point: I forgot to mention that we are looking for a uniform method for the whole company to select 'black swans'. That means: we have different 'fields' in which we do our analyses (about 10). Every field, depending on it's size is, has a different amount of risks for which we make benchmarks (ranging from 2 risks to something like 20 risks). Every risk is different: Risk A in field B contains for example scores on a 0 to 1 scale ('portions' like 0.1, 0.15, 0.16, 0.32, 0.86 etc...), while risk C in field D contains 'normal numbers' (continuous)(1000, 1250, 3256, 7500, etc), with no predefined boundaries. For every risk in every field we would like to use (as far as possible) the same method for selecting the 'black swans'. Obviously it will not be possible to find ONE method that works for all. Every risk/benchmark will have it's own skewness (one could be skewed to the left, one to the right), so eventually we have to look at the correct transformation for every single risk, but it would be good if we can develop some general guidelines on how to select the blackwans. Something like: for risks on a percentage/portion scale, we use logistic transformation, for continuous data we use the xxxx-transformation. Hence we don't want to look in depth, at the gaps (your 1st point) or at the percentiles (your 3rd point), in every single risk to select the black swans. We want it to be more like an automated process (as far as possible), without too much 'attention' or 'in depth knowldge' at the specific risk itself: (Random) risk/benchmark X appears --> Transform data appropriately (if percentage, then logistic transformation; if continuous, then xxx-transformation) --> Compute mean + sd --> Compute upper range --> Transform it back --> Select subjects with a score above the back-transformed upper range --> Result: X amount of subjects for further analyse. Regarding you 2nd and 4th point: let's see what comes out of your or others' reply on what I posted in this reply. Regarding your 5th point: what are those "distributions associated with extreme values"? A question to clarify something: is my method of judging if a variable is normally distributed somewhat correct, if I focus on the QQ-plot? Or do I need to focus more on the tests of normality?

zemastear · ‎10-22-2014

Hi all Currently at the company I work for, we use benchmarks of 'subjects' (I call it 'subjects' because of anonimity reasons) in order to discover subjects with 'unusual behaviour'. Basically what we do is: we have a set of subjects with a 'score' on a 'risk'. We want to discover the 'highest scorers' (hence we use benchmark) in order to 'further analyze them'. Until now, we have worked with percentiles in order to define the ‘subjects to further analyse’. Let’s say we have 150 subjects with a score between 0 and 1 (percentage), ranked from high to low. What we did until now is to say: let’s further analyse all the subjects above the 90th percentile = the 10% highest scorers. As you all know, this method doesn’t account for the actual scores, the mean of the scores or the spread within the scores. If I have 150 subjects, which score relatively the same on a risk, except for the top 3, which score very high, I don’t want to further investigate the top 10% (15 subjects), but only the top 3, right? Therefore we are now looking for a method better than percentiles to determine the ‘highest scorers’. FYI, we don’t bother with the lowest scorers (yet). One method we are thinking of is of course the use of the mean + X*standard deviations. Doing so, the subjects who fall outside our predefined upper range, defined as 1,28 (=upper 10%) or 1,64 (=upper 5%) or 1,96 (=upper 2,5%) times the standard deviation, will be flagged as ‘to be further analyzed’. So far, so good. This looks to me like a solid method to determine the highest scorers, while taking into account the mean and the spread in the data. As the title suggests though, most of the ‘risks’ we analyse, contain data which is not normally distributed. That still doesn’t have to be a problem, since data can be transformed in order to become normally distributed. I know of different kinds of transformation, according to Tukey’s ladder of transformation. After transformation I can calculate mean and upper range and then transform the data back to the original values in order to determine who are my subjects of interest. This sounds like a solid method to me, right? Transformation --> calculation --> back-transformation, is what is always used in science, as far as I know. Now we come to the main questions of this discussion: How do I determine if my data is normally distributed after transformation? What if my data is still not (perfectly) normally distributed even after transformation? Can I still use the method of mean + X*standard deviation in order to determine which subjects to further analyse? I have attached 3 files in order to show how I determine if my data is (somewhat) normally distributed. The data used to generate the 3 files, is already ln-transformed (log(x) in SAS, not log10(x)). The first picture shows the result the univariate procedure. In this output I look at different things: Skewness and kurtosis: If skewness is between -1 and +1, it suggests to me a normal distribution If kurtosis is < 1, it suggests to me a normal distribution. Mean and median If mean is approximately the same as the median, it suggests to me a normal distribution Tests for normality If the tests are NOT significant, it suggests to me a normal distribution. Then I look at the histogram: If it ‘looks like’ a normal distribution, it suggests to me a normal distribution. Then I look at the QQplot: If it is almost totally a straight line, it suggests to me a normal distribution. Based on the files I attached, I would decide that my ln-transformed data is distributed normally enough in order to do my mean + X*sd calculations. I realize that most of the judgments I make are ‘arbitrary’, except for the tests of normality. The only non-arbitrary measures of normality (the tests for normality) reject the hypothesis of normally distributed data, and still I would conclude that my data is distributed normally enough, based on what I ‘see’. Hence I am here asking for help on the matter. 1) Is my method of determining normality of the transformed data appropriate? If not, how can I best judge if my data is normally distributed? 2) What other transformations of tricks are there to get normally distributed data, of the ‘regular’ methods don’t work? 3) If I still do mean + X*sd calculations on not perfectly normal data, what are the consequences of this? And then specifically, what are the consequences of this looking at my initial questions, i.e. determine the high scorers / subjects of interest? Finally, do I REALLY require (perfectly) normally distributed data in order to select my ‘subjects of interest’ with the method of mean + X*sd?

zemastear · ‎12-13-2012

I do not expect something to be wrong with the data. It has been used for many other projects and it has been around for years. It is also constantly managed by someone who does all the data operations on it. I am only making use of the data as it is and did some minor editing, which I checked and I am sure are all correct. If we are not able to find the cause of this, then too bad I'd say. Thanks a lot anyway.

zemastear · ‎12-13-2012

I took a look, but age is not missing. It's actually quite complete.

zemastear · ‎12-12-2012

Here I am again. The results of my analyses were so poor that I decided to try this. After seeing the estimates of the random effect, I don't think that will put things in the right order of magnitude. The logit of the response is log (822/3473) - log (2651/3473) = -1.171. When I plug in values, I get values that are off by the magnitude you are reporting. I don't know exactly what you are calculating there and if it is important, but when I calculate this, I get -0.50853791? What happens when you use the LSMEANS statement? Try lsmeans gendum60/at means ilink e; lsmeans gendum60/at age=70 ilink e; lsmeans gendum60/at means ilink e; ERROR: Only class variables allowed in this effect. Smoking Full The GLIMMIX Procedure sex=1 Data Set WORK.dataname Response Variable smoking Response Distribution Binary Link Function Logit Variance Function Default Variance Matrix Blocked By id Estimation Technique Maximum Likelihood Likelihood Approximation Gauss-Hermite Quadrature Degrees of Freedom Method Containment ageclass 4 1 2 3 4 Number of Observations Read 11964 Number of Observations Used 8182 1 0 5605 2 1 2577 G-side Cov. Parameters 2 Columns in X 9 Columns in Z per Subject 2 Subjects (Blocks in V) 2991 Max Obs per Subject 3 Optimization Technique Dual Quasi-Newton Parameters in Optimization 10 Lower Boundaries 2 Upper Boundaries 0 Fixed Effects Not Profiled Starting From GLM estimates Quadrature Points 7 0 0 4 7924.0808061 . 296043.2 1 0 14 7464.3488984 459.73190764 11431.85 2 0 3 7235.5407215 228.80817687 1259.722 3 0 4 7222.9621138 12.57860771 1491.717 4 0 4 7217.0681408 5.89397306 1635.135 5 0 4 7216.2814485 0.78669222 1605.412 6 0 4 7206.613118 9.66833050 3964.965 7 0 4 7142.9150265 63.69809154 10381.62 8 0 3 7138.4727633 4.44226324 3083.605 9 0 2 7131.8228528 6.64991052 4798.037 10 0 4 7111.9430664 19.87978634 2394.794 11 0 2 7101.0562478 10.88681862 4136.373 12 0 3 7096.4972303 4.55901754 2491.948 13 0 3 7094.4543614 2.04286881 918.8764 14 0 4 7086.9710653 7.48329611 4893.221 15 0 2 7082.7288234 4.24224195 1764.441 16 0 3 7080.0529968 2.67582655 442.8867 17 0 3 7079.6473463 0.40565057 864.7367 18 0 2 7079.4226727 0.22467353 608.6162 19 0 2 7079.0440924 0.37858034 45.46735 20 0 4 7078.2098257 0.83426673 1435.201 21 0 4 7076.2935373 1.91628834 893.0885 22 0 3 7075.9743026 0.31923472 297.7529 23 0 3 7075.8962293 0.07807334 104.861 24 0 4 7075.6627013 0.23352799 1142.108 25 0 4 7073.7881887 1.87451260 1057.395 26 0 3 7073.0449224 0.74326630 340.2352 27 0 3 7072.9644057 0.08051665 157.9887 28 0 3 7072.957057 0.00734868 67.15604 29 0 3 7072.9554633 0.00159374 35.61595 30 0 4 7072.9476656 0.00779774 52.25779 31 0 3 7072.9459177 0.00174785 6.514212 32 0 3 7072.9457961 0.00012161 0.446825 33 0 3 7072.945793 0.00000312 0.040506 Convergence criterion (GCONV=1E-8) satisfied. -2 Log Likelihood 7072.95 AIC (smaller is better) 7092.95 AICC (smaller is better) 7092.97 BIC (smaller is better) 7152.98 CAIC (smaller is better) 7162.98 HQIC (smaller is better) 7114.54 -2 log L(smoking | r. effects) 2009.74 Pearson Chi-Square 1444.48 Pearson Chi-Square / DF 0.18 Intercept id 23.5344 2.4922 age id 0.004184 0.000835 ageclass 1 0.8444 0.7731 2209 1.09 0.2749 0.05 -0.6717 2.3605 ageclass 2 1.7424 0.7304 2209 2.39 0.0171 0.05 0.3101 3.1746 ageclass 3 1.0363 0.9537 2209 1.09 0.2774 0.05 -0.8340 2.9066 ageclass 4 9.0983 1.6102 2209 5.65 <.0001 0.05 5.9406 12.2560 age -0.2268 0.02837 2978 -8.00 <.0001 0.05 -0.2825 -0.1712 age*ageclass 1 0.1352 0.03717 2209 3.64 0.0003 0.05 0.06233 0.2081 age*ageclass 2 0.1237 0.03288 2209 3.76 0.0002 0.05 0.05918 0.1881 age*ageclass 3 0.1466 0.03351 2209 4.37 <.0001 0.05 0.08086 0.2123 age*ageclass 4 0 . . . . . . . ageclass 4 2209 9.67 <.0001 age 1 2978 107.62 <.0001 age*ageclass 3 2209 7.02 0.0001 ageclass 1 1 0.5977 0.002968 0.004430 0.01476 -0.00031 -0.01738 0.000216 0.000188 ageclass 2 2 0.002968 0.5334 0.009498 0.03279 -0.00068 0.000594 -0.01201 0.000424 ageclass 3 3 0.004430 0.009498 0.9096 0.04726 -0.00098 0.000860 0.000693 -0.01710 ageclass 4 4 0.01476 0.03279 0.04726 2.5928 -0.04467 0.04389 0.04336 0.04314 age 5 -0.00031 -0.00068 -0.00098 -0.04467 0.000805 -0.00079 -0.00078 -0.00077 age*ageclass 1 6 -0.01738 0.000594 0.000860 0.04389 -0.00079 0.001382 0.000763 0.000759 age*ageclass 2 7 0.000216 -0.01201 0.000693 0.04336 -0.00078 0.000763 0.001081 0.000753 age*ageclass 3 8 0.000188 0.000424 -0.01710 0.04314 -0.00077 0.000759 0.000753 0.001123 age*ageclass 4 9 ageclass 1 1 ageclass 2 1 ageclass 3 1 ageclass 4 1 age 45.45 45.45 45.45 45.45 age*ageclass 1 45.45 age*ageclass 2 45.45 age*ageclass 3 45.45 age*ageclass 4 45.45 1 45.45 514520 -3.3197 0.5074 2209 -6.54 <.0001 0.03490 0.01709 2 45.45 514520 -2.9478 0.2642 2209 -11.16 <.0001 0.04984 0.01251 3 45.45 514520 -2.6121 0.2471 2209 -10.57 <.0001 0.06837 0.01574 4 45.45 514520 -1.2118 0.4404 2209 -2.75 0.0060 0.2294 0.07785 ageclass 1 1 ageclass 2 1 ageclass 3 1 ageclass 4 1 age 40 40 40 40 age*ageclass 1 40 age*ageclass 2 40 age*ageclass 3 40 age*ageclass 4 40 1 40.00 514520 -2.8204 0.4056 2209 -6.95 <.0001 0.05623 0.02153 2 40.00 514520 -2.3854 0.2292 2209 -10.41 <.0001 0.08429 0.01769 3 40.00 514520 -2.1746 0.2810 2209 -7.74 <.0001 0.1021 0.02575 4 40.00 514520 0.02444 0.5534 2209 0.04 0.9648 0.5061 0.1383 ageclass 1 1 ageclass 2 1 ageclass 3 1 ageclass 4 1 leeftijd 50 50 50 50 age*ageclass 1 50 age*ageclass 2 50 age*ageclass 3 50 age*ageclass 4 50 1 50.00 514520 -3.7366 0.6025 2209 -6.20 <.0001 0.02328 0.01370 2 50.00 514520 -3.4173 0.3153 2209 -10.84 <.0001 0.03176 0.009695 3 50.00 514520 -2.9773 0.2523 2209 -11.80 <.0001 0.04846 0.01163 4 50.00 514520 -2.2440 0.3700 2209 -6.07 <.0001 0.09587 0.03207 Basically I just run the model you suggested and presented you the output. The ref=first statement somehow didn't work so I left it out. The model is now using category 4 as reference, right? I don't think it should matter much, when it is about the estimates/lsmeans that we are interested in? Though it would be nice to still have the youngest ageclass as reference. I tried order = data, but that didn't change things. Note: I used other age classes now (20-29, 30-39, 40-49, 50-59), since I am using another dataset now (I have two datasets that I use for this project). But it shouldn't matter, since I had the problems of not being able to estimate the correct prevalences in both datasets. I think the method is incorrect and therefore don't expect the datasets to influence the outcomes. So, what do we get here? We have three lsmeans statements. One that estimates for every ageclass the prevalence at their mean age? And two others which estimate the prevalence at age = 40 and age = 50? Looking at the tables, it looks like it still gives low prevalences? In these age classes, based in the raw data, I expect prevalences of around 30% (ranging from 20% to 40%) depending on the ageclass of interest, but definately not the 2-10% we are seeing now, am I right?

zemastear · ‎12-11-2012

We will just present the difference in prevalences based on the raw data. In order to say it those differences are significant, we will look at the model. So, if we see a difference of let's say 8% between generation 55-59 (reference) and 60-69at age 70, in the raw data, we will present that number. Then, we will look at this estimate: estimate 'gen 55-59 vs. gen 60-69 at age = 70' dum60 -1 age*dum60 -70 / cl; If this is significant ("Pr > |t|" < 0.05) then we will say: 8% difference is significant. I don't like this method so much....

zemastear · ‎12-11-2012

Hey Steve, Thanks again. I am far from an expert in sas, but I had looked around for myself a bit the other day and I already thought that it should be possible somehow to model what I want to model, and that lsmeans/lsmestimate looked like the good solution of first sight. I had no clue how to do it though, since I only know the statistics procedure that I have done in the past. Everything else is new to me, as is this. I talked to my statistician yesterday and now we decided to go another way (a way I do not really like so much, especially if your suggestions turns out to be working), but for now I will go on with what I have already, since time is running out for my project. I expect to have more time near the end of this month / start of next month (as I will be reviewing my article with the co-authors, I will have plenty of time inbetween I think) and that is when I want to try your method. So, I will let this here for now, but I will get back to it sometime. I just think its impossible that there is no way in sas to do what I want to do. Therefore I am interested in experimenting with your suggestion somewhere in the near future.

zemastear · ‎12-10-2012

Adding the random effect of age in the estimate statement (did it for the second and third estimate, not the first) did cause the prevalence to rise, but not near as high as it 'should' be / as I want it to be / as the raw data looks like.

zemastear · ‎12-10-2012

Smoking full model The GLIMMIX Procedure sex respondent=1 Data Set ** Response Variable smoker Response Distribution Binary Link Function Logit Variance Function Default Variance Matrix Blocked By id Estimation Technique Maximum Likelihood Likelihood Approximation Gauss-Hermite Quadrature Degrees of Freedom Method Containment Number of Observations Read 5790 Number of Observations Used 3473 Ordered Value smoker Total Frequency 1 0 2651 2 1 822 The GLIMMIX procedure is modeling the probability that roker='1'. G-side Cov. Parameters 2 Columns in X 8 Columns in Z per Subject 2 Subjects (Blocks in V) 931 Max Obs per Subject 6 Optimization Technique Dual Quasi-Newton Parameters in Optimization 10 Lower Boundaries 2 Upper Boundaries 0 Fixed Effects Not Profiled Starting From GLM estimates Quadrature Points 7 0 0 4 2664.2076557 . 247518.4 1 0 15 2531.8794796 132.32817606 10382.23 2 0 3 2413.6629589 118.21652075 12986.84 3 0 4 2411.987591 1.67536786 12781.79 4 0 4 2411.7530459 0.23454511 12714.24 5 0 3 2410.6125111 1.14053478 12417.07 6 0 4 2389.3924796 21.22003154 2125.985 7 0 2 2370.5918749 18.80060463 8539.3 8 0 2 2340.5986904 29.99318457 10467.24 9 0 3 2333.1846772 7.41401318 20957.11 10 0 4 2309.6450045 23.53967266 3675.545 11 0 3 2305.5393632 4.10564133 3792.51 12 0 3 2303.766906 1.77245715 1238.087 13 0 3 2303.3039486 0.46295749 929.9581 14 0 3 2303.0939216 0.21002696 224.6986 15 0 4 2299.8431727 3.25074887 2301.9 16 0 2 2297.2327543 2.61041844 1237.504 17 0 2 2293.4149064 3.81784791 535.4416 18 0 2 2288.5668127 4.84809373 2638.173 19 0 4 2287.8921513 0.67466138 1248.713 20 0 3 2287.5832383 0.30891300 125.6845 21 0 3 2287.5658631 0.01737515 127.9899 22 0 2 2287.5444346 0.02142856 101.2652 23 0 6 2287.0016694 0.54276516 394.4775 24 0 3 2286.6363686 0.36530084 150.4604 25 0 3 2286.4653755 0.17099304 193.7834 26 0 3 2286.3540602 0.11131532 36.6376 27 0 3 2286.3430099 0.01105026 35.43862 28 0 3 2286.3427712 0.00023876 9.727946 29 0 3 2286.3427548 0.00001641 3.908154 Convergence criterion (GCONV=1E-8) satisfied. -2 Log Likelihood 2286.34 AIC (smaller is better) 2306.34 AICC (smaller is better) 2306.41 BIC (smaller is better) 2354.71 CAIC (smaller is better) 2364.71 HQIC (smaller is better) 2324.79 -2 log L(smoker| r. effects) 690.83 Pearson Chi-Square 690.37 Pearson Chi-Square / DF 0.20 Covariance Parameter Estimates Cov Parm Subject Estimate Standard Error Intercept respnr 24.9367 4.9515 age respnr 0.002104 . Solutions for Fixed Effects Effect Estimate Standard Error DF t Value Pr > |t| Alpha Lower Upper Intercept 10.7707 2.1945 927 4.91 <.0001 0.05 6.4641 15.0774 dum60 1.6981 3.0139 1610 0.56 0.5732 0.05 -4.2135 7.6096 dum70 7.3854 4.1763 1610 1.77 0.0772 0.05 -0.8062 15.5770 dum80 3.7303 9.8024 1610 0.38 0.7036 0.05 -15.4965 22.9571 age -0.2081 0.03392 928 -6.13 <.0001 0.05 -0.2747 -0.1415 dum60*age -0.03188 0.04490 1610 -0.71 0.4778 0.05 -0.1200 0.05619 dum70*age -0.07526 0.05662 1610 -1.33 0.1840 0.05 -0.1863 0.03580 dum80*age -0.02561 0.1173 1610 -0.22 0.8272 0.05 -0.2557 0.2045 Type III Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F dum60 1 1610 0.32 0.5732 dum70 1 1610 3.13 0.0772 dum80 1 1610 0.14 0.7036 age 1 928 37.62 <.0001 dum60*age 1 1610 0.50 0.4778 dum70*age 1 1610 1.77 0.1840 dum80*age 1 1610 0.05 0.8272 Covariance matrix for fixed effects Intercept 1 4.8156 -4.6886 -4.6077 -4.6707 -0.07191 0.06941 0.06868 0.06894 dum60 2 -4.6886 9.0835 4.8457 4.8053 0.06926 -0.1317 -0.07167 -0.07161 dum70 3 -4.6077 4.8457 17.4418 4.8857 0.06754 -0.07216 -0.2319 -0.07319 dum80 4 -4.6707 4.8053 4.8857 96.0872 0.06902 -0.07145 -0.07225 -1.1404 age 5 -0.07191 0.06926 0.06754 0.06902 0.001151 -0.00110 -0.00108 -0.00109 dum60*age 6 0.06941 -0.1317 -0.07216 -0.07145 -0.00110 0.002016 0.001139 0.001138 dum70*age 7 0.06868 -0.07167 -0.2319 -0.07225 -0.00108 0.001139 0.003206 0.001154 dum80*age 8 0.06894 -0.07161 -0.07319 -1.1404 -0.00109 0.001138 0.001154 0.01377 Estimates used estimate 'gen 55-59 vs. gen 60-69 at age = 80' dum60 -1 age*dum60 -70 / cl; estimate 'gen 55-59 at age = 70' intercept 1 age 70 / ilink; estimate 'gen 60-69 at age = 70' intercept 1 dum60 1 age 70 age*dum60 70 / ilink; Label Estimate Standard Error DF t Value Pr > |t| Alpha Lower Upper Mean Standard Error Mean Lower Mean Upper Mean gen 60-69 vs. gen 70-79 at age = 80 -2.2174 0.7119 1610 -3.11 0.0019 0.05 -3.6138 -0.8210 Non-est . . . gen 60-69 at age = 80 -6.7284 0.5968 1610 -11.27 <.0001 0.05 -7.8991 -5.5577 0.001195 0.000712 0.000371 0.003843 gen 70-79 at age = 80 -4.5110 0.5080 1610 -8.88 <.0001 0.05 -5.5074 -3.5147 0.01087 0.005461 0.004040 0.02890 This is the output of my analysis (as asked for in the other topic, but let's continue here). I will try adding the random effect in the estimate statement. Do I need to add the random effect in all my estimates? So, also in the first estimate for the difference between two generations (gen 60-69 vs. gen 70-79 at age = 80)? If so, do I put it in the same way as for the other statement? (| leeftijd 70)? Thnx a lot Steve, I will let you know what happens.

zemastear · ‎12-08-2012

After quickly reading this topic, I think my problem/question is very similar. I have data of a cohort study with 4 rounds. In these 4 rounds, we asked respondents about smoking. We plotted the prevalence of smoking in different age groups, defined by their age at baseline. The link below shows an example of the analysis I want to do. The time between rounds is 5 years. So, if you have a group aged 40-49 (mean 45) at baseline (round 1), in round 3 (10 years later) this group will be aged 50-59 (mean 55). The prevalence of smoking in this group can be compared to the prevalence of the group aged 50-59 (mean 55) at baseline. Basically what I then want to say is: a younger generation (40-49) smokes less at mean age 55 then a older generation (50-59) at mean age 55. The link below shows an example (it's a link to a figure). It shows in this example that the 40-49 generation smokes 9% less at age 55. http://img35.imageshack.us/img35/2451/exampleej.jpg In order to model these lines, a statistician advised us to use proc glimmix. The model I used is: proc glimmix data=dataname initglm method=quad; model smoking(event="1") = dum60 dum70 dum80 age age*dum60 age*dum70 age*dum80 / dist=binary link=logit cl covb s; random intercept age / subject=id; by sex; run; quit; dum60 = generation aged 60-69 at baseline dum70 = generation aged 70-79 at baseline dum80 = generation aged 80-89 at baseline dum50 is the reference Now I want to use estimate statements in order to test what I described above (in the figure). What is the difference between two generation at a predefined age? And is this difference significant? I have tried to estimate the difference between the generation 60-69 at age 70 with the generation 70-79 at age 70 with these statements (based on the advice of the statistician): estimate dum60 1 dum70 -1 age*dum60 70 age*dum70 -70 / cl; (difference between the two lines) estimate dum60 1 intercept 1 age 70 age*dum60 70 / ilink cl; (prevalence of 60-69 group at age 70) estimate dum70 1 intercept 1 age 70 age*dum70 70 / ilink cl; (prevalence of 60-69 group at age 70) The first statement should tell me if the difference is significant. The second and third statements should tell me the prevalences of the two groups at age 70. We used ilink to show us these prevalences (we want to present the differences + significance in a bar chart). The weird thing is, the prevalences as shown by ilink do not correspond with the prevalences we find in the raw data. They do not look similar at all! ilink gives me prevalences of 1E-7 etc, while 'real' prevalences based on the raw data are around 25% and 20% for those generations. Is this the right method to test what I want to test, which is: the magnitude of the differences between two generations and whether this difference is significant. If not, how can it be done else within the glimmix procedure? Here is a link to my original topic, with a few other problems too: Hope you guys can help me. Thnx in advance.

zemastear · ‎12-07-2012

Hello there, I am trying to fit a multilevel random effect model on my data. The model looks like this: proc glimmix data=dataname initglm /*abspconv=1E-4*/ method=quad; model smoking(event="1") = dum60 dum70 dum80 age age*dum60 age*dum70 age*dum80 / dist=binary link=logit cl covb s; random intercept age / subject=id; by sex; run; quit; Smoking is a dichotomous outcome variable (yes = 1, no = 0), age as a continuous variable is one of the predictors and there is an interaction of age with generation. I made dummies (dum50 (reference category), dum60, dum70, dum80) for each generation. Generations are defined by their age at baseline. So, we have a generation which is age 60-69 at baseline (dum60), etc. I am using a long dataset and we have 4 observations for the subjects (cohort study). Hence we are using proc glimmix, to correct for repeated measurements. I have a few problems with this model. - Note the statement "by sex". I want to analyse men and women separately. When I run the model, everything works perfectly for men. For women, the model "doesn't work". I get an error: ERROR: Infeasible parameter values for evaluation of objective function with 1 quadrature point. I tried to google this, but I couldn't find it anywhere. I talked to a statistician already, but he couldn't really help me, except that he told me to change the converging criteria. I did a few things: 1) nloptions gconv = 1E-3 fconv = 1E-3. 2) abspconv=1E-4. 3) change method from=quad to method=laplace. All didn't work. - Another, totally different problem, is that when the model works with age and intercept as random effects, the prevalence estimates do not correspond with the prevalence I observe when making frequency tables. For example, I ran some estimates together with the model: estimate 'gen 60-69 at age = 70' intercept 1 gendum60 1 leeftijd 70 leeftijd*gendum60 70/ilink; It gave me a prevalence at age 70 of 1% or something, using the ilink feature, which calcs the outcome back to prevalences (right?). When I run frequency tables for this generation, I see that at age 70, they have a prevalence of ~20%. And it's not only for this specific generation at this age, it's for every estimate I do. The model doesn't represent my data well, at all. Somethimes it even gives prevalences of 1E-7, which is of course very weird. Again, I talked to the statistician and we tried to run the model w/o random intercept and age. The prevalences as estimated by the model were very accurate as compared to the real data! But my question is: what happens when you remove the random intercept and age? I understand where you correct for, when using them. But when removing them, am I still correcting for repeated measurements for every subject? One thing that I noted is that without random effects, 'Subjects in Blocks' is 1, instead of the ~1000 I usually have for men. Lots of stuff, I hope you guys can help me

Online Status	Offline
Date Last Visited	‎06-01-2018 06:29 PM

SAS Support Communities

Re: Calculating mean + sd on non-normal data

Re: Calculating mean + sd on non-normal data

Re: Calculating mean + sd on non-normal data

Re: Calculating mean + sd on non-normal data

Calculating mean + sd on non-normal data

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Calculating mean + sd on non-normal data

Re: Calculating mean + sd on non-normal data

Re: Calculating mean + sd on non-normal data

Re: Calculating mean + sd on non-normal data

Calculating mean + sd on non-normal data

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Modeling smoking in PROC GLIMMIX

Re: Proc genmod: testing the significance of the difference between gr...

Modeling smoking in PROC GLIMMIX

Follow Us

What is...