turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- data analysis with categorical response

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

Highlighted
# data analysis with categorical response

Options

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-19-2017 08:30 PM

Hi All,

I just got some data from my experiment. The depedent variable values in the data set were categorically assigned as 0, 2, 4, 6, 8, 10. Other than the categorical response, the data set is quite simple with only one treatment and 10 replicates. I'm just wondering if it's necesary to do some data transformation for the categorical response? And what will be a good program to do the analysis? Proc mixed? Proc glimmix?

Hope you could give me some advice on analyzing data with categorical response. I'll really appreciate it.

Thanks!

Accepted Solutions

Solution

06-20-2017
12:56 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jjin0322

06-20-2017 11:32 AM - edited 06-20-2017 11:43 AM

disease severity was assigned accroding to when the plants showed disease symptoms: 1-6 days=10, 7-10 days=8, 11-16 days=6, 17-22 days=4, 23-28 days=2, and no symptoms at day 28=0. disease severity was assigned accroding to when the plants showed disease symptoms: 1-6 days=10, 7-10 days=8, 11-16 days=6, 17-22 days=4, 23-28 days=2, and no symptoms at day 28=0

The onset of symptoms are on a continuous scale, although the recording of the levels breaks the values into discrete numbers. Whether you analyze this as continuous or discrete, they are both approximations to the actual value of number of days to onset of symptoms. Which is the better approximation? No way of knowing, but I lean towards continuous. In fact, you could select the midpoint of the range as your continuous level, which would be an even better approximation: 3.5 = 1-6 days, 8.5 = 7-10 days, 13.5 = 11-16 days, *etc*. (And if you're going to a similar study in the future, don't group the results into discrete categories, record the actual number of days!)

proc glimmix data=severity;

class isolate rep;

model disease=isolate;

random rep;

lsmeans isolate/ lines;

run;

I don't see a need for rep in the model, if you leave it out, then the replicates are lumped into the random error, which is where they should be. If you are going to explicitly include rep in the model, it must be nested within isolate, as in

model disease = isolate rep(isolate); random rep(isolate);

otherwise you will not get the right result.

--

Paige Miller

Paige Miller

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jjin0322

06-20-2017 09:32 AM

If your all X Y are category variable. Try Proc catmod

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Ksharp

06-20-2017 10:18 AM

Thanks for your suggestion!

Yes, in my case, Y which has values of 10,8,6,4,2,0, repsents the disease severity caused by a pathogen. X is the different pathogenic isolates, which is also categorical. What I am trying to do is to compare the disease severity caused by different pathogenic isolates on the same host.

I'll take a look at Proc catmod and try it out, thanks again!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jjin0322

06-20-2017 09:37 AM

No need for transformation.

Can these categorical data levels be assumed to actually be on a continuous scale or ordinal scale?

There are many choices for analysis

PROC GLIMMIX

PROC LOGISTIC

PROC CATMOD

PROC GENMOD

With the limited information you have provided, I don't think we can advise further

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

06-20-2017 10:07 AM

Hi PaigeMiller,

Thanks for your kind reply!

My experiment was to compare the disease severity caused by 21 pathogenic isolates on the same host plant. 10 host plants were used as replicates for each single isolate inoculation. So there were 21*10=210 plants used in total. And disease severity was assigned accroding to when the plants showed disease symptoms: 1-6 days=10, 7-10 days=8, 11-16 days=6, 17-22 days=4, 23-28 days=2, and no symptoms at day 28=0. I've attached the data set here, if you would like to take a look at it. I'd like to use PROC GLIMIXX, but I am not sure how to specify that the response is categorical in the SAS code. Hope you could give me more suggestions. Thanks in advance!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jjin0322

06-20-2017 10:26 AM

People here don't usually open Excel files for fear of viruses or other executables included.

I'm not sure why you need to consider these as categories, it seems that treating the results as numeric ought to work better than categories.

In that case, the code should be relatively simple:

UNTESTED CODE

proc glimmix;

class pathogen; model y = pathogen; run;

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

06-20-2017 11:10 AM

Oh.. sorry about the Excel files.

Since the dependent variable only takes 10, 8, 6, 4, 2, 0, I considered it as categriocal.

I guess this is a really naive question, can I take the dependent variable as numeric although the value is not continuously assigned?

The following the data set and the sas code I tried, just added rep as a random effect.

Really appreciated your help!

```
data severity;
input isolate $ rep disease;
datalines;
R0-G5-6 1 8
R0-G5-6 2 10
R0-G5-6 3 6
R0-G5-6 4 8
R0-G5-6 5 8
R0-G5-6 6 6
R0-G5-6 7 6
R0-G5-6 8 6
R0-G5-6 9 6
R0-G5-6 10 8
R0-G5-6A 1 0
R0-G5-6A 2 0
R0-G5-6A 3 0
R0-G5-6A 4 0
R0-G5-6A 5 0
R0-G5-6A 6 10
R0-G5-6A 7 0
R0-G5-6A 8 0
R0-G5-6A 9 0
R0-G5-6A 10 0
R0-G5-6B 1 8
R0-G5-6B 2 8
R0-G5-6B 3 8
R0-G5-6B 4 8
R0-G5-6B 5 8
R0-G5-6B 6 10
R0-G5-6B 7 6
R0-G5-6B 8 8
R0-G5-6B 9 8
R0-G5-6B 10 8
R0-G5-6C 1 10
R0-G5-6C 2 10
R0-G5-6C 3 8
R0-G5-6C 4 8
R0-G5-6C 5 8
R0-G5-6C 6 10
R0-G5-6C 7 6
R0-G5-6C 8 8
R0-G5-6C 9 8
R0-G5-6C 10 6
R0-G5-6E 1 6
R0-G5-6E 2 6
R0-G5-6E 3 4
R0-G5-6E 4 6
R0-G5-6E 5 6
R0-G5-6E 6 8
R0-G5-6E 7 6
R0-G5-6E 8 6
R0-G5-6E 9 8
R0-G5-6E 10 8
R0-G5-6F 1 4
R0-G5-6F 2 6
R0-G5-6F 3 6
R0-G5-6F 4 6
R0-G5-6F 5 8
R0-G5-6F 6 8
R0-G5-6F 7 8
R0-G5-6F 8 6
R0-G5-6F 9 8
R0-G5-6F 10 6
R0-G5-6G 1 6
R0-G5-6G 2 4
R0-G5-6G 3 6
R0-G5-6G 4 8
R0-G5-6G 5 6
R0-G5-6G 6 8
R0-G5-6G 7 8
R0-G5-6G 8 10
R0-G5-6G 9 2
R0-G5-6G 10 0
R0-G5-6H 1 10
R0-G5-6H 2 10
R0-G5-6H 3 4
R0-G5-6H 4 10
R0-G5-6H 5 6
R0-G5-6H 6 6
R0-G5-6H 7 8
R0-G5-6H 8 6
R0-G5-6H 9 10
R0-G5-6H 10 8
R0-G5-6I 1 6
R0-G5-6I 2 6
R0-G5-6I 3 8
R0-G5-6I 4 6
R0-G5-6I 5 6
R0-G5-6I 6 8
R0-G5-6I 7 8
R0-G5-6I 8 6
R0-G5-6I 9 8
R0-G5-6I 10 6
R0-G5-6J 1 8
R0-G5-6J 2 8
R0-G5-6J 3 8
R0-G5-6J 4 8
R0-G5-6J 5 8
R0-G5-6J 6 6
R0-G5-6J 7 6
R0-G5-6J 8 6
R0-G5-6J 9 8
R0-G5-6J 10 6
R0-G2-6 1 8
R0-G2-6 2 8
R0-G2-6 3 6
R0-G2-6 4 8
R0-G2-6 5 6
R0-G2-6 6 4
R0-G2-6 7 8
R0-G2-6 8 8
R0-G2-6 9 2
R0-G2-6 10 6
R0-G2-6A 1 6
R0-G2-6A 2 8
R0-G2-6A 3 8
R0-G2-6A 4 6
R0-G2-6A 5 8
R0-G2-6A 6 0
R0-G2-6A 7 8
R0-G2-6A 8 0
R0-G2-6A 9 0
R0-G2-6A 10 0
R0-G2-6B 1 6
R0-G2-6B 2 0
R0-G2-6B 3 0
R0-G2-6B 4 8
R0-G2-6B 5 6
R0-G2-6B 6 4
R0-G2-6B 7 4
R0-G2-6B 8 2
R0-G2-6B 9 6
R0-G2-6B 10 8
R0-G2-6C 1 8
R0-G2-6C 2 8
R0-G2-6C 3 8
R0-G2-6C 4 0
R0-G2-6C 5 0
R0-G2-6C 6 8
R0-G2-6C 7 0
R0-G2-6C 8 6
R0-G2-6C 9 6
R0-G2-6C 10 4
R0-G2-6D 1 2
R0-G2-6D 2 6
R0-G2-6D 3 4
R0-G2-6D 4 0
R0-G2-6D 5 2
R0-G2-6D 6 8
R0-G2-6D 7 0
R0-G2-6D 8 6
R0-G2-6D 9 6
R0-G2-6D 10 0
R0-G2-6E 1 0
R0-G2-6E 2 0
R0-G2-6E 3 8
R0-G2-6E 4 6
R0-G2-6E 5 6
R0-G2-6E 6 2
R0-G2-6E 7 8
R0-G2-6E 8 8
R0-G2-6E 9 4
R0-G2-6E 10 6
R0-G2-6F 1 2
R0-G2-6F 2 0
R0-G2-6F 3 8
R0-G2-6F 4 6
R0-G2-6F 5 6
R0-G2-6F 6 8
R0-G2-6F 7 6
R0-G2-6F 8 6
R0-G2-6F 9 8
R0-G2-6F 10 8
R0-G2-6G 1 0
R0-G2-6G 2 6
R0-G2-6G 3 0
R0-G2-6G 4 6
R0-G2-6G 5 2
R0-G2-6G 6 8
R0-G2-6G 7 6
R0-G2-6G 8 6
R0-G2-6G 9 6
R0-G2-6G 10 8
R0-G2-6H 1 6
R0-G2-6H 2 4
R0-G2-6H 3 8
R0-G2-6H 4 6
R0-G2-6H 5 2
R0-G2-6H 6 0
R0-G2-6H 7 8
R0-G2-6H 8 8
R0-G2-6H 9 8
R0-G2-6H 10 6
R0-G2-6I 1 4
R0-G2-6I 2 0
R0-G2-6I 3 0
R0-G2-6I 4 6
R0-G2-6I 5 6
R0-G2-6I 6 2
R0-G2-6I 7 6
R0-G2-6I 8 4
R0-G2-6I 9 8
R0-G2-6I 10 0
R0-G2-6J 1 6
R0-G2-6J 2 4
R0-G2-6J 3 6
R0-G2-6J 4 6
R0-G2-6J 5 6
R0-G2-6J 6 2
R0-G2-6J 7 6
R0-G2-6J 8 0
R0-G2-6J 9 2
R0-G2-6J 10 8
;
proc print data=severity;
proc glimmix data=severity;
class isolate rep;
model disease=isolate;
random rep;
lsmeans isolate/ lines;
run;
```

Solution

06-20-2017
12:56 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jjin0322

06-20-2017 11:32 AM - edited 06-20-2017 11:43 AM

disease severity was assigned accroding to when the plants showed disease symptoms: 1-6 days=10, 7-10 days=8, 11-16 days=6, 17-22 days=4, 23-28 days=2, and no symptoms at day 28=0. disease severity was assigned accroding to when the plants showed disease symptoms: 1-6 days=10, 7-10 days=8, 11-16 days=6, 17-22 days=4, 23-28 days=2, and no symptoms at day 28=0

The onset of symptoms are on a continuous scale, although the recording of the levels breaks the values into discrete numbers. Whether you analyze this as continuous or discrete, they are both approximations to the actual value of number of days to onset of symptoms. Which is the better approximation? No way of knowing, but I lean towards continuous. In fact, you could select the midpoint of the range as your continuous level, which would be an even better approximation: 3.5 = 1-6 days, 8.5 = 7-10 days, 13.5 = 11-16 days, *etc*. (And if you're going to a similar study in the future, don't group the results into discrete categories, record the actual number of days!)

proc glimmix data=severity;

class isolate rep;

model disease=isolate;

random rep;

lsmeans isolate/ lines;

run;

I don't see a need for rep in the model, if you leave it out, then the replicates are lumped into the random error, which is where they should be. If you are going to explicitly include rep in the model, it must be nested within isolate, as in

model disease = isolate rep(isolate); random rep(isolate);

otherwise you will not get the right result.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

06-20-2017 12:56 PM

Thank you so much for the suggestions and the correction of the SAS code!

I actually have the actual number of days recorded, I can definitely try that for the analysis.

Thanks again! I really appreciated your great help!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

06-21-2017 10:30 AM

Hi PaigeMiller,

Sorry to keep bugging you. I just came across this question when I was trying to do the analysis using the actual number of days instead of the disease severity values assigned.

The experiment was done in a 28-day period, some of the plants did not show any symptoms at day 28. Theoretically, the number of days for those plants did not show any symptoms (healthy plants) will be infinite. I was wondering how should I deal with this kind of situation? Hope you could give me some tips.

Thank you!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jjin0322

06-21-2017 01:02 PM

This is called right-censored data, when the measurement stops at some time, but the true value you'd like to observe hasn't occured yet. Here is an example where right-censored data is analyzed

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PaigeMiller

06-21-2017 01:37 PM

Thank you so much!