Comparing Datasets (Performance)

mahler_ji · Posted 07-15-2014 02:34 PM

Hey Guys,

I have a question about comparing datasets and discering whether or not they are "different"

I have two different metrics by which I categorize firms into buckets. I have results over time for both metrics. How can I compare them to each other and see how related they are?

I am looking for better metrics to compare them other than their average performance, correlation and volatility.

What procedures could I use to obtain this information? I have some experience with proc reg, but is that the right one here?

Thanks,

John

mahler_ji · Posted 07-15-2014 05:34 PM

I am digging way back into my statistics classes, but isn't there a way to figure out whether two datasets come from the same larger dataset? Can't I use that idea here, where I can compare the two results and if they are sufficiently similar then the test would return that they are from the same dataset?

I think that was called a t-stat and a p-value?

SASKiwi · Posted 07-15-2014 05:37 PM

Have you looked at PROC COMPARE? This enables you to compare every every row (by row number or key variable(s)) and value to see if they are identical.

If it is more the case of comparing distributions of numeric variables then I suggest you look at PROC UNIVARIATE.

mahler_ji · Posted 07-15-2014 05:41 PM

Yes, it is more comparing distributions or, more specifically, comparing values over time. Do the values rise together and fall together at the same times. Do the values follow the same pattern.

I hope that helps!

John

Reeza · Posted 07-15-2014 05:51 PM

It depends on what the values are, more info would help.

The first thing to do would be to plot them and visually compare, scatterplot and if time is a factor as you've indicated in your most recent post, then by time.

You can do a t-test, if you can assume a normal distribution or if your data is large enough. There are also non-parametric tests you can use, via proc npar1way.

Since you don't have any other factors a regression doesn't seem useful.

mahler_ji · Posted 07-15-2014 06:02 PM

Hey ,

I think that I can assume that i have a normal distribution and I have a lot of data points in each dataset. I just did some googling and it looks like there is a proc ttest. What sort of things can this tell me?

I have attached an .csv with a sample of the datasets that I am working with. When you plot them over time, it is clear that they are very highly correlated with each other. I am just looking for more concrete ways to quantify that other than a correlation value and some averages.

Thanks for your help!

John

Reeza · Posted 07-15-2014 06:32 PM

I'd difference the two series and look to see if the difference was 0 and establish a confidence interval around the difference.

That being said, I'm going to assume that how these two indicators are calculated are not independent of one another so you don't have the independent assumption met for most statistical tests.