Hi,
I am looking for a way to find out if there is a significant correlation (p-value) between some variables in two of my datasets. I have attached the excel-file with the data.
So far I have tried doing this:
data twotrees (label="My 1st Dataset");
infile 'C:\Users\MajaThuren\OneDrive\Dokumenter\Skovbrugsvidenskab\Thesis\Data for r_datasset2.prn' FIRSTOBS=2;
input site $ subplot $ oneandtwo $ species $;
run;
proc print data=twotrees;
run;
proc contents data=twotrees;
run;
data thesis (label="My 2nd Dataset");
infile 'C:\Users\MajaThuren\OneDrive\Dokumenter\Skovbrugsvidenskab\Thesis\Data for r.prn' FIRSTOBS=2;
input site $ subplot $ number $ species $;
run;
proc print data=thesis;
run;
proc contents data=thesis;
run;
It works fine, but it isn't quite what I'm looking for.
I have made subplots in in two sites. In these subplots I have an arbitraty amounts of trees in each plot (number). This number varies between subplots. I have noted the species for each of these trees in each subplot.
In my other dataset I have noted which species the nearest two large trees are.
I suspect there is a connection between the species of the nearest large trees and the abundance of species that occur within the subplot. I would like to test this hypothesis.
I don't know how to mesh these two datasets though. The main dataset have very varying number of trees in each subplot (there are 29 subplots in the non-intervention site and 15 in the planted site) and the trees associated species. The secondary dataset have the same amount of subplots, but two entries for each subplot and two associated species.
I am looking for a p-value to determine if there is any correspondance between the nearest large tree's species and the abundance of that species in the associated subplot.
Any suggestions?
Best regards
Clarification requested: Correlation is a measure, p-value is the result of a statistical test. If you want a p-value what exact test do you want to run?
Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.
A basic start to compare things from two data sets is to combine the data and add a variable indicating which data set each record comes from. Then use that variable as an indicator in analysis.
An example:
data combined; set twotrees thesis indsname=indata ; source=indata; run;
The above uses the Set statement option to create a temporary variable (not saved) named indata that has the name of the data set each record comes from. The Source=indata; statement add a variable named Source with the data set name and is kept.
I'm not looking at your Excel data for the reasons mentioned. However from the code you show you have no variables valid for calculating correlation as that requires numeric variables.
You might be able to do some categorical distribution analysis with a chi-square test using proc freq.
Example data is best provided as data step code posted in a text box on the forum.
Hi @ballardw
It is a good question what test I want to run. So far I have been using PROC GLM, to determine p-value. I am happy to keep on with that if it could apply to the mixing of datasets, but really I will use what ever procedure works better.
About the excel-file, I understand. Here is a excerpt from my main dataset:
site | subplot | number | species |
NI | 1 | 1 | PA |
NI | 1 | 2 | PA |
NI | 1 | 3 | PA |
NI | 1 | 4 | PA |
NI | 1 | 5 | PS |
NI | 1 | 6 | L |
NI | 1 | 7 | PA |
NI | 1 | 8 | PA |
NI | 1 | 9 | PA |
NI | 1 | 10 | PA |
NI | 1 | 11 | PA |
NI | 1 | 12 | PA |
NI | 1 | 13 | AA |
NI | 1 | 14 | PA |
NI | 1 | 15 | PA |
NI | 1 | 16 | PA |
NI | 1 | 17 | PA |
NI | 1 | 18 | PA |
NI | 1 | 19 | PA |
NI | 2 | 1 | PA |
NI | 2 | 2 | PA |
NI | 2 | 3 | PA |
NI | 2 | 4 | PA |
NI | 2 | 5 | PA |
NI | 2 | 6 | PA |
NI | 2 | 7 | PA |
This dataset contains 1186 observations (29 subplots my first site and 15 in the second site).
For each of these subplots I have noted in a separate dataset which species the two nearest large trees are.
This dataset looks like this:
site | subplot | oneandtwo | species |
NI | 1 | 1 | PS |
NI | 1 | 2 | PS |
NI | 2 | 1 | PS |
NI | 2 | 2 | PS |
NI | 3 | 1 | PS |
NI | 3 | 2 | PS |
NI | 4 | 1 | PS |
NI | 4 | 2 | PS |
NI | 5 | 1 | PS |
NI | 5 | 2 | PS |
NI | 6 | 1 | PS3 |
NI | 6 | 2 | PS |
NI | 7 | 1 | PS |
NI | 7 | 2 | PS |
NI | 8 | 1 | PA |
NI | 8 | 2 | PA |
NI | 9 | 1 | PS |
NI | 9 | 2 | PS |
NI | 10 | 1 | PS |
NI | 10 | 2 | PS |
NI | 11 | 1 | PS |
NI | 11 | 2 | PS |
NI | 12 | 1 | PS |
NI | 12 | 2 | PS |
NI | 13 | 1 | PS3 |
NI | 13 | 2 | L |
NI | 14 | 1 | PS |
NI | 14 | 2 | PS |
NI | 15 | 1 | PS |
NI | 15 | 2 | PS |
NI | 16 | 1 | PS |
NI | 16 | 2 | PS |
NI | 17 | 1 | PS |
NI | 17 | 2 | PS |
NI | 18 | 1 | PS |
NI | 18 | 2 | PS |
NI | 19 | 1 | L |
NI | 19 | 2 | PS |
NI | 20 | 1 | PS |
NI | 20 | 2 | PS |
NI | 21 | 1 | PS |
NI | 21 | 2 | PS |
NI | 22 | 1 | PS3 |
NI | 22 | 2 | PA |
NI | 23 | 1 | PS |
NI | 23 | 2 | PS |
NI | 24 | 1 | PS |
NI | 24 | 2 | PS |
NI | 25 | 1 | PS |
NI | 25 | 2 | PS3 |
NI | 26 | 1 | PS |
NI | 26 | 2 | PS |
NI | 27 | 1 | PS |
NI | 27 | 2 | PS |
NI | 28 | 1 | PS |
NI | 28 | 2 | PS |
NI | 29 | 1 | PS3 |
NI | 29 | 2 | PS |
P | 1 | 1 | PS |
P | 1 | 2 | PS |
P | 2 | 1 | L |
P | 2 | 2 | L |
P | 3 | 1 | PS |
P | 3 | 2 | PA |
P | 4 | 1 | PS |
P | 4 | 2 | PA |
P | 5 | 1 | PS |
P | 5 | 2 | PS |
P | 6 | 1 | L |
P | 6 | 2 | L |
P | 7 | 1 | L |
P | 7 | 2 | PS |
P | 10 | 1 | L |
P | 10 | 2 | L |
P | 12 | 1 | L |
P | 12 | 2 | PS |
P | 13 | 1 | PA |
P | 13 | 2 | PS |
P | 14 | 1 | L |
P | 14 | 2 | L |
P | 15 | 1 | L |
P | 15 | 2 | L |
P | 16 | 1 | L |
P | 16 | 2 | L |
P | 17 | 1 | PA |
P | 17 | 2 | PS |
P | 18 | 1 | L |
P | 18 | 2 | L |
So in subplot 1 there is most species PA, but also the species PS amongst the 19 trees that were in that subplot (main dataset). The two nearest large trees to subplot one is both species PS. I want to prove (or disprove) that there is a connection between the dispersion of species within a subplot (the regeneration) and the species of the (seed-dispersing) large nearby trees.
I don't know how to mesh these two datasets in SAS though. The main dataset have observations varying between 11 and 100 depending on the subplot. The secondary dataset always have two observations for each subplot (the two trees). So that is my question. I hope I have made myself clear.
Best regards
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: