03-09-2016 01:23 PM - edited 03-09-2016 01:25 PM
I am currently working with a data set which has many 'true' zero values and cannot be normalized. I have opted to use the spearman rho correlation coefficient to determine the relationship between my variables. This has worked very well. However, I am now looking to find a non-parametric coefficient of determination (r-square) value so that I can discuss the amount of shared variation. I have come across some conflicting information online (go figure) and am new to SAS.
I am using SAS v9.4.
My data are proportional 16s sequence values. I have 156 libraries each containing ~70k sequences.
Thank you for your help,
03-09-2016 02:53 PM
You will have to provide some more details of what you are looking for. Such as why is the number of libraries mentioned? Are your "sequences" actually data sets?
And what areyou comparing to get the r-square?
03-09-2016 05:32 PM
RE: You will have to provide some more details of what you are looking for. Such as why is the number of libraries mentioned? Are your "sequences" actually data sets?
And what areyou comparing to get the r-square?
I must apologize for the lack of information as I was unsure as to the extent I should detail it.
My data consists of two parts. The first is bacterial in nature (i.e. the sequence information) and the second is physiological (e.g. mortality and developmental status).
I am looking to draw correlations between the presence/absence of specific bacteria and the rate of mortality and development. With honey bees there are only a hand full (five) of core bacteria of which several have been shown to form epithelial biofilms and other have been shown to cause a melanization of said cells.
Although the data cannot be normalized for pearson's correlations, I would like to see how the abundance of a given bacteria correlates (relationship) to the physiological parameters I have data for. Thus I turned to the spearman's rho. This statistic however does not allow me to guage how much of the variation in Y (the physiological parameter) is explained by X (the bacterial parameter).
I would like to be able to talk about the extent (r^2 or an alternative) that the bacterial abundance explains the physiological parameter it is being compared to.
I must say here that most of the spearman correlations are monotonic in nature, however the relationships of interest to me are more linear in nature (goodness of fit tests before and after transformations < 0.03). That is they are not able to be normalized to meet the assumptions of normality but they have, graphically speaking, a linear relationship. It would be of benefit to me to be able to place a value on the amount of variation explained.
For example, I have data about bacterium, found in my study in low and high abundance between different treatments, which has been shown to form a close relationship with the host epitheial cells (biofilm), and would like to show its correlation to mortality and deveopment at the different abundance levels.
I hope this helps a bit. If not please let me know.
Again, I am looking for a single value such as the r^2 to show how much of the variation in Y can be explained by X in a nearly linear relationship which is not able to be normalized.
From what I have read (limited in most respects) that if the relationship in nearly linear one could use the spearman's rho for the p-value and the pearson's r^2 for the explaination of shared variance.
Again, any help would be greatly appreacitaed.
03-09-2016 06:00 PM
Where does the multiplicity of libraries come in?
Generally any of the procedure that generate statistics require a single input data set which is why I am confused about the mention of all those libraries.
03-09-2016 06:10 PM
RE: "Where does the multiplicity of libraries come in?
Generally any of the procedure that generate statistics require a single input data set which is why I am confused about the mention of all those libraries."
The experiment is as follows:
Four treament groups- Fresh Pollen, Fresh Supplement, Old Pollen, Old Supplement
Each treatment group had 11 cages of bees (50 newly emerged bees per cage)
Within each treatment group we disected four tissues (hypo glands, mouth parts, ileum and rectum) from 3 individuals per cage and pooled together for sequencing.
For each of the cages we monitored mortality, thorax developemnt and hypo gland development.
What I am interested in is the tissue as a whole.
So how do the bacterial reads (abundance), reguardless of treatment, correlate to the mortality, thorax developemnt and hypo gland development and once a significant correlation is found, how do I determine shared variance?