BookmarkSubscribeRSS Feed
mahler_ji
Obsidian | Level 7

Hey Guys,

I have a question about comparing datasets and discering whether or not they are "different"

I have two different metrics by which I categorize firms into buckets.  I have results over time for both metrics.  How can I compare them to each other and see how related they are?

I am looking for better metrics to compare them other than their average performance, correlation and volatility.

What procedures could I use to obtain this information?  I have some experience with proc reg, but is that the right one here?

Thanks,

John

7 REPLIES 7
mahler_ji
Obsidian | Level 7

I am digging way back into my statistics classes, but isn't there a way to figure out whether two datasets come from the same larger dataset?  Can't I use that idea here, where I can compare the two results and if they are sufficiently similar then the test would return that they are from the same dataset?

I think that was called a t-stat and a p-value?

SASKiwi
PROC Star

Have you looked at PROC COMPARE? This enables you to compare every every row (by row number or key variable(s)) and value to see if they are identical.

If it is more the case of comparing distributions of numeric variables then I suggest you look at PROC UNIVARIATE.

mahler_ji
Obsidian | Level 7

Yes, it is more comparing distributions or, more specifically, comparing values over time.  Do the values rise together and fall together at the same times.  Do the values follow the same pattern.

I hope that helps!

John

Reeza
Super User

It depends on what the values are, more info would help.

The first thing to do would be to plot them and visually compare, scatterplot and if time is a factor as you've indicated in your most recent post, then by time.

You can do a t-test, if you can assume a normal distribution or if your data is large enough. There are also non-parametric tests you can use, via proc npar1way.

Since you don't have any other factors a regression doesn't seem useful.

mahler_ji
Obsidian | Level 7

Hey ,

I think that I can assume that i have a normal distribution and I have a lot of data points in each dataset.  I just did some googling and it looks like there is a proc ttest.  What sort of things can this tell me?

I have attached an .csv with a sample of the datasets that I am working with.  When you plot them over time, it is clear that they are very highly correlated with each other.  I am just looking for more concrete ways to quantify that other than a correlation value and some averages.

Thanks for your help!

John

Reeza
Super User

I'd difference the two series and look to see if the difference was 0 and establish a confidence interval around the difference.

That being said, I'm going to assume that how these two indicators are calculated are not independent of one another so you don't have the independent assumption met for most statistical tests.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1121 views
  • 3 likes
  • 4 in conversation