Help using Base SAS procedures

Comparing Datasets (Performance)

Reply
Frequent Contributor
Posts: 101

Comparing Datasets (Performance)

Hey Guys,

I have a question about comparing datasets and discering whether or not they are "different"

I have two different metrics by which I categorize firms into buckets.  I have results over time for both metrics.  How can I compare them to each other and see how related they are?

I am looking for better metrics to compare them other than their average performance, correlation and volatility.

What procedures could I use to obtain this information?  I have some experience with proc reg, but is that the right one here?

Thanks,

John

Frequent Contributor
Posts: 101

Re: Comparing Datasets (Performance)

I am digging way back into my statistics classes, but isn't there a way to figure out whether two datasets come from the same larger dataset?  Can't I use that idea here, where I can compare the two results and if they are sufficiently similar then the test would return that they are from the same dataset?

I think that was called a t-stat and a p-value?

Super User
Posts: 3,101

Re: Comparing Datasets (Performance)

Have you looked at PROC COMPARE? This enables you to compare every every row (by row number or key variable(s)) and value to see if they are identical.

If it is more the case of comparing distributions of numeric variables then I suggest you look at PROC UNIVARIATE.

Frequent Contributor
Posts: 101

Re: Comparing Datasets (Performance)

Yes, it is more comparing distributions or, more specifically, comparing values over time.  Do the values rise together and fall together at the same times.  Do the values follow the same pattern.

I hope that helps!

John

Super User
Posts: 17,775

Re: Comparing Datasets (Performance)

It depends on what the values are, more info would help.

The first thing to do would be to plot them and visually compare, scatterplot and if time is a factor as you've indicated in your most recent post, then by time.

You can do a t-test, if you can assume a normal distribution or if your data is large enough. There are also non-parametric tests you can use, via proc npar1way.

Since you don't have any other factors a regression doesn't seem useful.

Frequent Contributor
Posts: 101

Re: Comparing Datasets (Performance)

Hey ,

I think that I can assume that i have a normal distribution and I have a lot of data points in each dataset.  I just did some googling and it looks like there is a proc ttest.  What sort of things can this tell me?

I have attached an .csv with a sample of the datasets that I am working with.  When you plot them over time, it is clear that they are very highly correlated with each other.  I am just looking for more concrete ways to quantify that other than a correlation value and some averages.

Thanks for your help!

John

Attachment
Super User
Posts: 17,775

Re: Comparing Datasets (Performance)

I'd difference the two series and look to see if the difference was 0 and establish a confidence interval around the difference.

That being said, I'm going to assume that how these two indicators are calculated are not independent of one another so you don't have the independent assumption met for most statistical tests.

Occasional Contributor
Posts: 12

Re: Comparing Datasets (Performance)

trying using data step with update

Step-by-Step Programming with Base SAS(R) Software

Ask a Question
Discussion stats
  • 7 replies
  • 325 views
  • 3 likes
  • 4 in conversation