06-01-2018
zemastear
Calcite | Level 5
Member since
12-07-2012
- 14 Posts
- 0 Likes Given
- 0 Solutions
- 0 Likes Received
-
Latest posts by zemastear
Subject Views Posted 6182 10-28-2014 03:52 AM 6182 10-23-2014 02:50 AM 6182 10-23-2014 02:47 AM 6182 10-22-2014 09:48 AM 6427 10-22-2014 08:32 AM 4003 12-13-2012 08:27 AM 4003 12-13-2012 08:11 AM 4092 12-12-2012 10:17 AM 4092 12-11-2012 08:30 AM 4092 12-11-2012 08:06 AM -
Activity Feed for zemastear
- Posted Re: Calculating mean + sd on non-normal data on SAS Procedures. 10-28-2014 03:52 AM
- Posted Re: Calculating mean + sd on non-normal data on SAS Procedures. 10-23-2014 02:50 AM
- Posted Re: Calculating mean + sd on non-normal data on SAS Procedures. 10-23-2014 02:47 AM
- Posted Re: Calculating mean + sd on non-normal data on SAS Procedures. 10-22-2014 09:48 AM
- Posted Calculating mean + sd on non-normal data on SAS Procedures. 10-22-2014 08:32 AM
- Posted Re: Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-13-2012 08:27 AM
- Posted Re: Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-13-2012 08:11 AM
- Posted Re: Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-12-2012 10:17 AM
- Posted Re: Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-11-2012 08:30 AM
- Posted Re: Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-11-2012 08:06 AM
- Posted Re: Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-10-2012 09:21 AM
- Posted Re: Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-10-2012 08:50 AM
- Posted Re: Proc genmod: testing the significance of the difference between groups on SAS Procedures. 12-08-2012 07:39 PM
- Posted Modeling smoking in PROC GLIMMIX on Statistical Procedures. 12-07-2012 10:39 AM
10-28-2014
03:52 AM
Thank you for all your answers so far Steve. It has given me enough information to work on with for now
... View more
10-23-2014
02:50 AM
I don't quite understand what you mean, as I am not familiar at all with clustering procedures. Can you roughly explain what they are and what they do? In the meantime, I will look around for myself too.
... View more
10-23-2014
02:47 AM
Regarding the extreme value distributions: I read about Weibull when I was looking around for information about this matter, but I have no clue what it does or how to use it. Thanks for referring to Wikipedia. Regarding the Saphiro-Wilk test: I know that’s the best one to use and good idea setting the alpha very low. The ‘problem’ is we have some risks with more than 2000 observations, for which the S-W-test doesn’t work. Are there alternatives?
... View more
10-22-2014
09:48 AM
Thanks a lot for the swift reply Steve. Regarding your 1st point: I forgot to mention that we are looking for a uniform method for the whole company to select 'black swans'. That means: we have different 'fields' in which we do our analyses (about 10). Every field, depending on it's size is, has a different amount of risks for which we make benchmarks (ranging from 2 risks to something like 20 risks). Every risk is different: Risk A in field B contains for example scores on a 0 to 1 scale ('portions' like 0.1, 0.15, 0.16, 0.32, 0.86 etc...), while risk C in field D contains 'normal numbers' (continuous)(1000, 1250, 3256, 7500, etc), with no predefined boundaries. For every risk in every field we would like to use (as far as possible) the same method for selecting the 'black swans'. Obviously it will not be possible to find ONE method that works for all. Every risk/benchmark will have it's own skewness (one could be skewed to the left, one to the right), so eventually we have to look at the correct transformation for every single risk, but it would be good if we can develop some general guidelines on how to select the blackwans. Something like: for risks on a percentage/portion scale, we use logistic transformation, for continuous data we use the xxxx-transformation. Hence we don't want to look in depth, at the gaps (your 1st point) or at the percentiles (your 3rd point), in every single risk to select the black swans. We want it to be more like an automated process (as far as possible), without too much 'attention' or 'in depth knowldge' at the specific risk itself: (Random) risk/benchmark X appears --> Transform data appropriately (if percentage, then logistic transformation; if continuous, then xxx-transformation) --> Compute mean + sd --> Compute upper range --> Transform it back --> Select subjects with a score above the back-transformed upper range --> Result: X amount of subjects for further analyse. Regarding you 2nd and 4th point: let's see what comes out of your or others' reply on what I posted in this reply. Regarding your 5th point: what are those "distributions associated with extreme values"? A question to clarify something: is my method of judging if a variable is normally distributed somewhat correct, if I focus on the QQ-plot? Or do I need to focus more on the tests of normality?
... View more
10-22-2014
08:32 AM
Hi all Currently at the company I work for, we use benchmarks of 'subjects' (I call it 'subjects' because of anonimity reasons) in order to discover subjects with 'unusual behaviour'. Basically what we do is: we have a set of subjects with a 'score' on a 'risk'. We want to discover the 'highest scorers' (hence we use benchmark) in order to 'further analyze them'. Until now, we have worked with percentiles in order to define the ‘subjects to further analyse’. Let’s say we have 150 subjects with a score between 0 and 1 (percentage), ranked from high to low. What we did until now is to say: let’s further analyse all the subjects above the 90th percentile = the 10% highest scorers. As you all know, this method doesn’t account for the actual scores, the mean of the scores or the spread within the scores. If I have 150 subjects, which score relatively the same on a risk, except for the top 3, which score very high, I don’t want to further investigate the top 10% (15 subjects), but only the top 3, right? Therefore we are now looking for a method better than percentiles to determine the ‘highest scorers’. FYI, we don’t bother with the lowest scorers (yet). One method we are thinking of is of course the use of the mean + X*standard deviations. Doing so, the subjects who fall outside our predefined upper range, defined as 1,28 (=upper 10%) or 1,64 (=upper 5%) or 1,96 (=upper 2,5%) times the standard deviation, will be flagged as ‘to be further analyzed’. So far, so good. This looks to me like a solid method to determine the highest scorers, while taking into account the mean and the spread in the data. As the title suggests though, most of the ‘risks’ we analyse, contain data which is not normally distributed. That still doesn’t have to be a problem, since data can be transformed in order to become normally distributed. I know of different kinds of transformation, according to Tukey’s ladder of transformation. After transformation I can calculate mean and upper range and then transform the data back to the original values in order to determine who are my subjects of interest. This sounds like a solid method to me, right? Transformation --> calculation --> back-transformation, is what is always used in science, as far as I know. Now we come to the main questions of this discussion: How do I determine if my data is normally distributed after transformation? What if my data is still not (perfectly) normally distributed even after transformation? Can I still use the method of mean + X*standard deviation in order to determine which subjects to further analyse? I have attached 3 files in order to show how I determine if my data is (somewhat) normally distributed. The data used to generate the 3 files, is already ln-transformed (log(x) in SAS, not log10(x)). The first picture shows the result the univariate procedure. In this output I look at different things: Skewness and kurtosis: If skewness is between -1 and +1, it suggests to me a normal distribution If kurtosis is < 1, it suggests to me a normal distribution. Mean and median If mean is approximately the same as the median, it suggests to me a normal distribution Tests for normality If the tests are NOT significant, it suggests to me a normal distribution. Then I look at the histogram: If it ‘looks like’ a normal distribution, it suggests to me a normal distribution. Then I look at the QQplot: If it is almost totally a straight line, it suggests to me a normal distribution. Based on the files I attached, I would decide that my ln-transformed data is distributed normally enough in order to do my mean + X*sd calculations. I realize that most of the judgments I make are ‘arbitrary’, except for the tests of normality. The only non-arbitrary measures of normality (the tests for normality) reject the hypothesis of normally distributed data, and still I would conclude that my data is distributed normally enough, based on what I ‘see’. Hence I am here asking for help on the matter. 1) Is my method of determining normality of the transformed data appropriate? If not, how can I best judge if my data is normally distributed? 2) What other transformations of tricks are there to get normally distributed data, of the ‘regular’ methods don’t work? 3) If I still do mean + X*sd calculations on not perfectly normal data, what are the consequences of this? And then specifically, what are the consequences of this looking at my initial questions, i.e. determine the high scorers / subjects of interest? Finally, do I REALLY require (perfectly) normally distributed data in order to select my ‘subjects of interest’ with the method of mean + X*sd?
... View more
12-13-2012
08:27 AM
I do not expect something to be wrong with the data. It has been used for many other projects and it has been around for years. It is also constantly managed by someone who does all the data operations on it. I am only making use of the data as it is and did some minor editing, which I checked and I am sure are all correct. If we are not able to find the cause of this, then too bad I'd say. Thanks a lot anyway.
... View more
12-13-2012
08:11 AM
I took a look, but age is not missing. It's actually quite complete.
... View more
12-12-2012
10:17 AM
Here I am again. The results of my analyses were so poor that I decided to try this.
After seeing the estimates of the random effect, I don't think that will put things in the right order of magnitude. The logit of the response is log (822/3473) - log (2651/3473) = -1.171. When I plug in values, I get values that are off by the magnitude you are reporting.
I don't know exactly what you are calculating there and if it is important, but when I calculate this, I get -0.50853791?
What happens when you use the LSMEANS statement?
Try
lsmeans gendum60/at means ilink e;
lsmeans gendum60/at age=70 ilink e;
lsmeans gendum60/at means ilink e; ERROR: Only class variables allowed in this effect.
Smoking Full
The GLIMMIX Procedure
sex=1
Data Set WORK.dataname Response Variable smoking Response Distribution Binary Link Function Logit Variance Function Default Variance Matrix Blocked By id Estimation Technique Maximum Likelihood Likelihood Approximation Gauss-Hermite Quadrature Degrees of Freedom Method Containment
ageclass 4 1 2 3 4
Number of Observations Read 11964 Number of Observations Used 8182
1 0 5605 2 1 2577
G-side Cov. Parameters 2 Columns in X 9 Columns in Z per Subject 2 Subjects (Blocks in V) 2991 Max Obs per Subject 3
Optimization Technique Dual Quasi-Newton Parameters in Optimization 10 Lower Boundaries 2 Upper Boundaries 0 Fixed Effects Not Profiled Starting From GLM estimates Quadrature Points 7
0 0 4 7924.0808061 . 296043.2 1 0 14 7464.3488984 459.73190764 11431.85 2 0 3 7235.5407215 228.80817687 1259.722 3 0 4 7222.9621138 12.57860771 1491.717 4 0 4 7217.0681408 5.89397306 1635.135 5 0 4 7216.2814485 0.78669222 1605.412 6 0 4 7206.613118 9.66833050 3964.965 7 0 4 7142.9150265 63.69809154 10381.62 8 0 3 7138.4727633 4.44226324 3083.605 9 0 2 7131.8228528 6.64991052 4798.037 10 0 4 7111.9430664 19.87978634 2394.794 11 0 2 7101.0562478 10.88681862 4136.373 12 0 3 7096.4972303 4.55901754 2491.948 13 0 3 7094.4543614 2.04286881 918.8764 14 0 4 7086.9710653 7.48329611 4893.221 15 0 2 7082.7288234 4.24224195 1764.441 16 0 3 7080.0529968 2.67582655 442.8867 17 0 3 7079.6473463 0.40565057 864.7367 18 0 2 7079.4226727 0.22467353 608.6162 19 0 2 7079.0440924 0.37858034 45.46735 20 0 4 7078.2098257 0.83426673 1435.201 21 0 4 7076.2935373 1.91628834 893.0885 22 0 3 7075.9743026 0.31923472 297.7529 23 0 3 7075.8962293 0.07807334 104.861 24 0 4 7075.6627013 0.23352799 1142.108 25 0 4 7073.7881887 1.87451260 1057.395 26 0 3 7073.0449224 0.74326630 340.2352 27 0 3 7072.9644057 0.08051665 157.9887 28 0 3 7072.957057 0.00734868 67.15604 29 0 3 7072.9554633 0.00159374 35.61595 30 0 4 7072.9476656 0.00779774 52.25779 31 0 3 7072.9459177 0.00174785 6.514212 32 0 3 7072.9457961 0.00012161 0.446825 33 0 3 7072.945793 0.00000312 0.040506
Convergence criterion (GCONV=1E-8) satisfied.
-2 Log Likelihood 7072.95 AIC (smaller is better) 7092.95 AICC (smaller is better) 7092.97 BIC (smaller is better) 7152.98 CAIC (smaller is better) 7162.98 HQIC (smaller is better) 7114.54
-2 log L(smoking | r. effects) 2009.74 Pearson Chi-Square 1444.48 Pearson Chi-Square / DF 0.18
Intercept id 23.5344 2.4922 age id 0.004184 0.000835
ageclass 1 0.8444 0.7731 2209 1.09 0.2749 0.05 -0.6717 2.3605 ageclass 2 1.7424 0.7304 2209 2.39 0.0171 0.05 0.3101 3.1746 ageclass 3 1.0363 0.9537 2209 1.09 0.2774 0.05 -0.8340 2.9066 ageclass 4 9.0983 1.6102 2209 5.65 <.0001 0.05 5.9406 12.2560 age -0.2268 0.02837 2978 -8.00 <.0001 0.05 -0.2825 -0.1712 age*ageclass 1 0.1352 0.03717 2209 3.64 0.0003 0.05 0.06233 0.2081 age*ageclass 2 0.1237 0.03288 2209 3.76 0.0002 0.05 0.05918 0.1881 age*ageclass 3 0.1466 0.03351 2209 4.37 <.0001 0.05 0.08086 0.2123 age*ageclass 4 0 . . . . . . .
ageclass 4 2209 9.67 <.0001 age 1 2978 107.62 <.0001 age*ageclass 3 2209 7.02 0.0001
ageclass 1 1 0.5977 0.002968 0.004430 0.01476 -0.00031 -0.01738 0.000216 0.000188 ageclass 2 2 0.002968 0.5334 0.009498 0.03279 -0.00068 0.000594 -0.01201 0.000424 ageclass 3 3 0.004430 0.009498 0.9096 0.04726 -0.00098 0.000860 0.000693 -0.01710 ageclass 4 4 0.01476 0.03279 0.04726 2.5928 -0.04467 0.04389 0.04336 0.04314 age 5 -0.00031 -0.00068 -0.00098 -0.04467 0.000805 -0.00079 -0.00078 -0.00077 age*ageclass 1 6 -0.01738 0.000594 0.000860 0.04389 -0.00079 0.001382 0.000763 0.000759 age*ageclass 2 7 0.000216 -0.01201 0.000693 0.04336 -0.00078 0.000763 0.001081 0.000753 age*ageclass 3 8 0.000188 0.000424 -0.01710 0.04314 -0.00077 0.000759 0.000753 0.001123 age*ageclass 4 9
ageclass 1 1 ageclass 2 1 ageclass 3 1 ageclass 4 1 age 45.45 45.45 45.45 45.45 age*ageclass 1 45.45 age*ageclass 2 45.45 age*ageclass 3 45.45 age*ageclass 4 45.45
1 45.45 514520 -3.3197 0.5074 2209 -6.54 <.0001 0.03490 0.01709 2 45.45 514520 -2.9478 0.2642 2209 -11.16 <.0001 0.04984 0.01251 3 45.45 514520 -2.6121 0.2471 2209 -10.57 <.0001 0.06837 0.01574 4 45.45 514520 -1.2118 0.4404 2209 -2.75 0.0060 0.2294 0.07785
ageclass 1 1 ageclass 2 1 ageclass 3 1 ageclass 4 1 age 40 40 40 40 age*ageclass 1 40 age*ageclass 2 40 age*ageclass 3 40 age*ageclass 4 40
1 40.00 514520 -2.8204 0.4056 2209 -6.95 <.0001 0.05623 0.02153 2 40.00 514520 -2.3854 0.2292 2209 -10.41 <.0001 0.08429 0.01769 3 40.00 514520 -2.1746 0.2810 2209 -7.74 <.0001 0.1021 0.02575 4 40.00 514520 0.02444 0.5534 2209 0.04 0.9648 0.5061 0.1383
ageclass 1 1 ageclass 2 1 ageclass 3 1 ageclass 4 1 leeftijd 50 50 50 50 age*ageclass 1 50 age*ageclass 2 50 age*ageclass 3 50 age*ageclass 4 50
1 50.00 514520 -3.7366 0.6025 2209 -6.20 <.0001 0.02328 0.01370 2 50.00 514520 -3.4173 0.3153 2209 -10.84 <.0001 0.03176 0.009695 3 50.00 514520 -2.9773 0.2523 2209 -11.80 <.0001 0.04846 0.01163 4 50.00 514520 -2.2440 0.3700 2209 -6.07 <.0001 0.09587 0.03207
Basically I just run the model you suggested and presented you the output. The ref=first statement somehow didn't work so I left it out. The model is now using category 4 as reference, right? I don't think it should matter much, when it is about the estimates/lsmeans that we are interested in? Though it would be nice to still have the youngest ageclass as reference. I tried order = data, but that didn't change things. Note: I used other age classes now (20-29, 30-39, 40-49, 50-59), since I am using another dataset now (I have two datasets that I use for this project). But it shouldn't matter, since I had the problems of not being able to estimate the correct prevalences in both datasets. I think the method is incorrect and therefore don't expect the datasets to influence the outcomes. So, what do we get here? We have three lsmeans statements. One that estimates for every ageclass the prevalence at their mean age? And two others which estimate the prevalence at age = 40 and age = 50? Looking at the tables, it looks like it still gives low prevalences? In these age classes, based in the raw data, I expect prevalences of around 30% (ranging from 20% to 40%) depending on the ageclass of interest, but definately not the 2-10% we are seeing now, am I right?
... View more
12-11-2012
08:30 AM
We will just present the difference in prevalences based on the raw data. In order to say it those differences are significant, we will look at the model. So, if we see a difference of let's say 8% between generation 55-59 (reference) and 60-69at age 70, in the raw data, we will present that number. Then, we will look at this estimate: estimate 'gen 55-59 vs. gen 60-69 at age = 70' dum60 -1 age*dum60 -70 / cl; If this is significant ("Pr > |t|" < 0.05) then we will say: 8% difference is significant. I don't like this method so much....
... View more
12-11-2012
08:06 AM
Hey Steve, Thanks again. I am far from an expert in sas, but I had looked around for myself a bit the other day and I already thought that it should be possible somehow to model what I want to model, and that lsmeans/lsmestimate looked like the good solution of first sight. I had no clue how to do it though, since I only know the statistics procedure that I have done in the past. Everything else is new to me, as is this. I talked to my statistician yesterday and now we decided to go another way (a way I do not really like so much, especially if your suggestions turns out to be working), but for now I will go on with what I have already, since time is running out for my project. I expect to have more time near the end of this month / start of next month (as I will be reviewing my article with the co-authors, I will have plenty of time inbetween I think) and that is when I want to try your method. So, I will let this here for now, but I will get back to it sometime. I just think its impossible that there is no way in sas to do what I want to do. Therefore I am interested in experimenting with your suggestion somewhere in the near future.
... View more
12-10-2012
09:21 AM
Adding the random effect of age in the estimate statement (did it for the second and third estimate, not the first) did cause the prevalence to rise, but not near as high as it 'should' be / as I want it to be / as the raw data looks like.
... View more
12-10-2012
08:50 AM
Smoking full model
The GLIMMIX Procedure
sex respondent=1
Data Set
**
Response Variable
smoker
Response Distribution
Binary
Link Function
Logit
Variance Function
Default
Variance Matrix Blocked By
id
Estimation Technique
Maximum Likelihood
Likelihood Approximation
Gauss-Hermite Quadrature
Degrees of Freedom Method
Containment
Number of Observations Read
5790
Number of Observations Used
3473
Ordered Value
smoker
Total Frequency
1 0
2651
2 1
822
The GLIMMIX procedure is modeling
the probability that roker='1'.
G-side Cov. Parameters
2
Columns in X
8
Columns in Z per Subject
2
Subjects (Blocks in V)
931
Max Obs per Subject
6
Optimization Technique
Dual Quasi-Newton
Parameters in Optimization
10
Lower Boundaries
2
Upper Boundaries
0
Fixed Effects
Not Profiled
Starting From
GLM estimates
Quadrature Points
7
0 0
4
2664.2076557
.
247518.4
1 0
15
2531.8794796
132.32817606
10382.23
2 0
3
2413.6629589
118.21652075
12986.84
3 0
4
2411.987591
1.67536786
12781.79
4 0
4
2411.7530459
0.23454511
12714.24
5 0
3
2410.6125111
1.14053478
12417.07
6 0
4
2389.3924796
21.22003154
2125.985
7 0
2
2370.5918749
18.80060463
8539.3
8 0
2
2340.5986904
29.99318457
10467.24
9 0
3
2333.1846772
7.41401318
20957.11
10 0
4
2309.6450045
23.53967266
3675.545
11 0
3
2305.5393632
4.10564133
3792.51
12 0
3
2303.766906
1.77245715
1238.087
13 0
3
2303.3039486
0.46295749
929.9581
14 0
3
2303.0939216
0.21002696
224.6986
15 0
4
2299.8431727
3.25074887
2301.9
16 0
2
2297.2327543
2.61041844
1237.504
17 0
2
2293.4149064
3.81784791
535.4416
18 0
2
2288.5668127
4.84809373
2638.173
19 0
4
2287.8921513
0.67466138
1248.713
20 0
3
2287.5832383
0.30891300
125.6845
21 0
3
2287.5658631
0.01737515
127.9899
22 0
2
2287.5444346
0.02142856
101.2652
23 0
6
2287.0016694
0.54276516
394.4775
24 0
3
2286.6363686
0.36530084
150.4604
25 0
3
2286.4653755
0.17099304
193.7834
26 0
3
2286.3540602
0.11131532
36.6376
27 0
3
2286.3430099
0.01105026
35.43862
28 0
3
2286.3427712
0.00023876
9.727946
29 0
3
2286.3427548
0.00001641
3.908154
Convergence criterion (GCONV=1E-8) satisfied.
-2 Log Likelihood
2286.34
AIC (smaller is better)
2306.34
AICC (smaller is better)
2306.41
BIC (smaller is better)
2354.71
CAIC (smaller is better)
2364.71
HQIC (smaller is better)
2324.79
-2 log L(smoker| r. effects)
690.83
Pearson Chi-Square
690.37
Pearson Chi-Square / DF
0.20
Covariance Parameter Estimates
Cov Parm
Subject
Estimate
Standard Error
Intercept respnr
24.9367
4.9515
age respnr
0.002104
.
Solutions for Fixed Effects
Effect
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Intercept
10.7707
2.1945
927
4.91
<.0001
0.05
6.4641
15.0774
dum60
1.6981
3.0139
1610
0.56
0.5732
0.05
-4.2135
7.6096
dum70
7.3854
4.1763
1610
1.77
0.0772
0.05
-0.8062
15.5770
dum80
3.7303
9.8024
1610
0.38
0.7036
0.05
-15.4965
22.9571
age
-0.2081
0.03392
928
-6.13
<.0001
0.05
-0.2747
-0.1415
dum60*age
-0.03188
0.04490
1610
-0.71
0.4778
0.05
-0.1200
0.05619
dum70*age
-0.07526
0.05662
1610
-1.33
0.1840
0.05
-0.1863
0.03580
dum80*age
-0.02561
0.1173
1610
-0.22
0.8272
0.05
-0.2557
0.2045
Type III Tests of Fixed Effects
Effect
Num DF
Den DF
F Value
Pr > F
dum60
1
1610
0.32
0.5732
dum70
1
1610
3.13
0.0772
dum80
1
1610
0.14
0.7036
age
1
928
37.62
<.0001
dum60*age
1
1610
0.50
0.4778
dum70*age
1
1610
1.77
0.1840
dum80*age
1
1610
0.05
0.8272
Covariance matrix for fixed effects
Intercept 1
4.8156
-4.6886
-4.6077
-4.6707
-0.07191
0.06941
0.06868
0.06894
dum60 2
-4.6886
9.0835
4.8457
4.8053
0.06926
-0.1317
-0.07167
-0.07161
dum70 3
-4.6077
4.8457
17.4418
4.8857
0.06754
-0.07216
-0.2319
-0.07319
dum80 4
-4.6707
4.8053
4.8857
96.0872
0.06902
-0.07145
-0.07225
-1.1404
age 5
-0.07191
0.06926
0.06754
0.06902
0.001151
-0.00110
-0.00108
-0.00109
dum60*age 6
0.06941
-0.1317
-0.07216
-0.07145
-0.00110
0.002016
0.001139
0.001138
dum70*age 7
0.06868
-0.07167
-0.2319
-0.07225
-0.00108
0.001139
0.003206
0.001154
dum80*age 8
0.06894
-0.07161
-0.07319
-1.1404
-0.00109
0.001138
0.001154
0.01377
Estimates used
estimate 'gen 55-59 vs. gen 60-69 at age = 80' dum60 -1 age*dum60 -70 / cl;
estimate 'gen 55-59 at age = 70' intercept 1 age 70 / ilink;
estimate 'gen 60-69 at age = 70' intercept 1 dum60 1 age 70 age*dum60 70 / ilink;
Label
Estimate
Standard Error
DF
t Value
Pr > |t|
Alpha
Lower
Upper
Mean
Standard
Error
Mean
Lower
Mean
Upper
Mean
gen 60-69 vs. gen 70-79 at age = 80
-2.2174
0.7119
1610
-3.11
0.0019
0.05
-3.6138
-0.8210
Non-est
.
.
.
gen 60-69 at age = 80
-6.7284
0.5968
1610
-11.27
<.0001
0.05
-7.8991
-5.5577
0.001195
0.000712
0.000371
0.003843
gen 70-79 at age = 80
-4.5110
0.5080
1610
-8.88
<.0001
0.05
-5.5074
-3.5147
0.01087
0.005461
0.004040
0.02890
This is the output of my analysis (as asked for in the other topic, but let's continue here). I will try adding the random effect in the estimate statement. Do I need to add the random effect in all my estimates? So, also in the first estimate for the difference between two generations (gen 60-69 vs. gen 70-79 at age = 80)? If so, do I put it in the same way as for the other statement? (| leeftijd 70)? Thnx a lot Steve, I will let you know what happens.
... View more
12-08-2012
07:39 PM
After quickly reading this topic, I think my problem/question is very similar. I have data of a cohort study with 4 rounds. In these 4 rounds, we asked respondents about smoking. We plotted the prevalence of smoking in different age groups, defined by their age at baseline. The link below shows an example of the analysis I want to do. The time between rounds is 5 years. So, if you have a group aged 40-49 (mean 45) at baseline (round 1), in round 3 (10 years later) this group will be aged 50-59 (mean 55). The prevalence of smoking in this group can be compared to the prevalence of the group aged 50-59 (mean 55) at baseline. Basically what I then want to say is: a younger generation (40-49) smokes less at mean age 55 then a older generation (50-59) at mean age 55. The link below shows an example (it's a link to a figure). It shows in this example that the 40-49 generation smokes 9% less at age 55. http://img35.imageshack.us/img35/2451/exampleej.jpg In order to model these lines, a statistician advised us to use proc glimmix. The model I used is: proc glimmix data=dataname initglm method=quad; model smoking(event="1") = dum60 dum70 dum80 age age*dum60 age*dum70 age*dum80 / dist=binary link=logit cl covb s; random intercept age / subject=id; by sex; run; quit; dum60 = generation aged 60-69 at baseline dum70 = generation aged 70-79 at baseline dum80 = generation aged 80-89 at baseline dum50 is the reference Now I want to use estimate statements in order to test what I described above (in the figure). What is the difference between two generation at a predefined age? And is this difference significant? I have tried to estimate the difference between the generation 60-69 at age 70 with the generation 70-79 at age 70 with these statements (based on the advice of the statistician): estimate dum60 1 dum70 -1 age*dum60 70 age*dum70 -70 / cl; (difference between the two lines) estimate dum60 1 intercept 1 age 70 age*dum60 70 / ilink cl; (prevalence of 60-69 group at age 70) estimate dum70 1 intercept 1 age 70 age*dum70 70 / ilink cl; (prevalence of 60-69 group at age 70) The first statement should tell me if the difference is significant. The second and third statements should tell me the prevalences of the two groups at age 70. We used ilink to show us these prevalences (we want to present the differences + significance in a bar chart). The weird thing is, the prevalences as shown by ilink do not correspond with the prevalences we find in the raw data. They do not look similar at all! ilink gives me prevalences of 1E-7 etc, while 'real' prevalences based on the raw data are around 25% and 20% for those generations. Is this the right method to test what I want to test, which is: the magnitude of the differences between two generations and whether this difference is significant. If not, how can it be done else within the glimmix procedure? Here is a link to my original topic, with a few other problems too: Hope you guys can help me. Thnx in advance.
... View more
12-07-2012
10:39 AM
Hello there, I am trying to fit a multilevel random effect model on my data. The model looks like this: proc glimmix data=dataname initglm /*abspconv=1E-4*/ method=quad; model smoking(event="1") = dum60 dum70 dum80 age age*dum60 age*dum70 age*dum80 / dist=binary link=logit cl covb s; random intercept age / subject=id; by sex; run; quit; Smoking is a dichotomous outcome variable (yes = 1, no = 0), age as a continuous variable is one of the predictors and there is an interaction of age with generation. I made dummies (dum50 (reference category), dum60, dum70, dum80) for each generation. Generations are defined by their age at baseline. So, we have a generation which is age 60-69 at baseline (dum60), etc. I am using a long dataset and we have 4 observations for the subjects (cohort study). Hence we are using proc glimmix, to correct for repeated measurements. I have a few problems with this model. - Note the statement "by sex". I want to analyse men and women separately. When I run the model, everything works perfectly for men. For women, the model "doesn't work". I get an error: ERROR: Infeasible parameter values for evaluation of objective function with 1 quadrature point. I tried to google this, but I couldn't find it anywhere. I talked to a statistician already, but he couldn't really help me, except that he told me to change the converging criteria. I did a few things: 1) nloptions gconv = 1E-3 fconv = 1E-3. 2) abspconv=1E-4. 3) change method from=quad to method=laplace. All didn't work. - Another, totally different problem, is that when the model works with age and intercept as random effects, the prevalence estimates do not correspond with the prevalence I observe when making frequency tables. For example, I ran some estimates together with the model: estimate 'gen 60-69 at age = 70' intercept 1 gendum60 1 leeftijd 70 leeftijd*gendum60 70/ilink; It gave me a prevalence at age 70 of 1% or something, using the ilink feature, which calcs the outcome back to prevalences (right?). When I run frequency tables for this generation, I see that at age 70, they have a prevalence of ~20%. And it's not only for this specific generation at this age, it's for every estimate I do. The model doesn't represent my data well, at all. Somethimes it even gives prevalences of 1E-7, which is of course very weird. Again, I talked to the statistician and we tried to run the model w/o random intercept and age. The prevalences as estimated by the model were very accurate as compared to the real data! But my question is: what happens when you remove the random intercept and age? I understand where you correct for, when using them. But when removing them, am I still correcting for repeated measurements for every subject? One thing that I noted is that without random effects, 'Subjects in Blocks' is 1, instead of the ~1000 I usually have for men. Lots of stuff, I hope you guys can help me
... View more