BookmarkSubscribeRSS Feed
Lysegroentblad
Obsidian | Level 7

Hi,

I am looking for a way to find out if there is a significant correlation (p-value) between some variables in two of my datasets. I have attached the excel-file with the data. 

So far I have tried doing this:

data twotrees (label="My 1st Dataset");
    infile 'C:\Users\MajaThuren\OneDrive\Dokumenter\Skovbrugsvidenskab\Thesis\Data for r_datasset2.prn' FIRSTOBS=2;
    input site $ subplot $ oneandtwo $ species $;
run;
 
proc print data=twotrees;
run;
 
proc contents data=twotrees;
run;
 
data thesis (label="My 2nd Dataset");
    infile 'C:\Users\MajaThuren\OneDrive\Dokumenter\Skovbrugsvidenskab\Thesis\Data for r.prn' FIRSTOBS=2;
    input site $ subplot $ number $ species $;
run;
 
proc print data=thesis;
run;
 
proc contents data=thesis;
run;

It works fine, but it isn't quite what I'm looking for. 

I have made subplots in in two sites. In these subplots I have an arbitraty amounts of trees in each plot (number). This number varies between subplots. I have noted the species for each of these trees in each subplot.

In my other dataset I have noted which species the nearest two large trees are.
I suspect there is a connection between the  species of the nearest large trees and the abundance of species that occur within the subplot. I would like to test this hypothesis.

I don't know how to mesh these two datasets though. The main dataset have very varying number of trees in each subplot (there are 29 subplots in the non-intervention site and 15 in the planted site) and the trees associated species. The secondary dataset have the same amount of subplots, but two entries for each subplot and two associated species.

I am looking for a p-value to determine if there is any correspondance between the nearest large tree's species and the abundance of that species in the associated subplot.

 

Any suggestions?

 

Best regards

2 REPLIES 2
ballardw
Super User

Clarification requested: Correlation is a measure, p-value is the result of a statistical test. If you want a p-value what exact test do you want to run?

 

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

 

A basic start to compare things from two data sets is to combine the data and add a variable indicating which data set each record comes from. Then use that variable as an indicator in analysis.

An example:

data combined;
   set twotrees
       thesis
       indsname=indata
   ;
   source=indata;
run;

The above uses the Set statement option to create a temporary variable (not saved) named indata that has the name of the data set each record comes from. The Source=indata; statement add a variable named Source with the data set name and is kept.

 

I'm not looking at your Excel data for the reasons mentioned. However from the code you show you have no variables valid for calculating correlation as that requires numeric variables.

You might be able to do some categorical distribution analysis with a chi-square test using proc freq.

 

Example data is best provided as data step code posted in a text box on the forum.

 

 

 

Lysegroentblad
Obsidian | Level 7

Hi @ballardw 

 

It is a good question what test I want to run. So far I have been using PROC GLM, to determine p-value. I am happy to keep on with that if it could apply to the mixing of datasets, but really I will use what ever procedure works better.

 

About the excel-file, I understand. Here is a excerpt from my main dataset:

sitesubplotnumberspecies
NI11PA
NI12PA
NI13PA
NI14PA
NI15PS
NI16L
NI17PA
NI18PA
NI19PA
NI110PA
NI111PA
NI112PA
NI113AA
NI114PA
NI115PA
NI116PA
NI117PA
NI118PA
NI119PA
NI21PA
NI22PA
NI23PA
NI24PA
NI25PA
NI26PA
NI27PA

This dataset contains 1186 observations (29 subplots my first site and 15 in the second site). 

For each of these subplots I have noted in a separate dataset which species the two nearest large trees are.

 This dataset looks like this:

sitesubplotoneandtwospecies
NI11PS
NI12PS
NI21PS
NI22PS
NI31PS
NI32PS
NI41PS
NI42PS
NI51PS
NI52PS
NI61PS3
NI62PS
NI71PS
NI72PS
NI81PA
NI82PA
NI91PS
NI92PS
NI101PS
NI102PS
NI111PS
NI112PS
NI121PS
NI122PS
NI131PS3
NI132L
NI141PS
NI142PS
NI151PS
NI152PS
NI161PS
NI162PS
NI171PS
NI172PS
NI181PS
NI182PS
NI191L
NI192PS
NI201PS
NI202PS
NI211PS
NI212PS
NI221PS3
NI222PA
NI231PS
NI232PS
NI241PS
NI242PS
NI251PS
NI252PS3
NI261PS
NI262PS
NI271PS
NI272PS
NI281PS
NI282PS
NI291PS3
NI292PS
P11PS
P12PS
P21L
P22L
P31PS
P32PA
P41PS
P42PA
P51PS
P52PS
P61L
P62L
P71L
P72PS
P101L
P102L
P121L
P122PS
P131PA
P132PS
P141L
P142L
P151L
P152L
P161L
P162L
P171PA
P172PS
P181L
P182L

So in subplot 1 there is most species PA, but also the species PS amongst the 19 trees that were in that subplot (main dataset). The two nearest large trees to subplot one is both species PS. I want to prove (or disprove) that there is a connection between the dispersion of species within a subplot (the regeneration) and the species of the (seed-dispersing) large nearby trees.

I don't know how to mesh these two datasets in SAS though. The main dataset have observations varying between 11 and 100 depending on the subplot. The secondary dataset always have two observations for each subplot (the two trees). So that is my question. I hope I have made myself clear.

 

Best regards

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 435 views
  • 0 likes
  • 2 in conversation