Hey Guys,
I have a question about comparing datasets and discering whether or not they are "different"
I have two different metrics by which I categorize firms into buckets. I have results over time for both metrics. How can I compare them to each other and see how related they are?
I am looking for better metrics to compare them other than their average performance, correlation and volatility.
What procedures could I use to obtain this information? I have some experience with proc reg, but is that the right one here?
Thanks,
John
I am digging way back into my statistics classes, but isn't there a way to figure out whether two datasets come from the same larger dataset? Can't I use that idea here, where I can compare the two results and if they are sufficiently similar then the test would return that they are from the same dataset?
I think that was called a t-stat and a p-value?
Have you looked at PROC COMPARE? This enables you to compare every every row (by row number or key variable(s)) and value to see if they are identical.
If it is more the case of comparing distributions of numeric variables then I suggest you look at PROC UNIVARIATE.
Yes, it is more comparing distributions or, more specifically, comparing values over time. Do the values rise together and fall together at the same times. Do the values follow the same pattern.
I hope that helps!
John
It depends on what the values are, more info would help.
The first thing to do would be to plot them and visually compare, scatterplot and if time is a factor as you've indicated in your most recent post, then by time.
You can do a t-test, if you can assume a normal distribution or if your data is large enough. There are also non-parametric tests you can use, via proc npar1way.
Since you don't have any other factors a regression doesn't seem useful.
I think that I can assume that i have a normal distribution and I have a lot of data points in each dataset. I just did some googling and it looks like there is a proc ttest. What sort of things can this tell me?
I have attached an .csv with a sample of the datasets that I am working with. When you plot them over time, it is clear that they are very highly correlated with each other. I am just looking for more concrete ways to quantify that other than a correlation value and some averages.
Thanks for your help!
John
I'd difference the two series and look to see if the difference was 0 and establish a confidence interval around the difference.
That being said, I'm going to assume that how these two indicators are calculated are not independent of one another so you don't have the independent assumption met for most statistical tests.
trying using data step with update
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.