Solved: How do I statistically compare variables with different size datasets?

Corinthian94 · Posted 01-27-2022 04:40 PM

Hi all, just wondering how I would statistically compare variables to each other if I had two datasets that had different amounts of observations. Essentially I'm trying to find if there are any significant differences between variables between a population at baseline and follow-up, the follow-up of which has only about half of the baseline participants. I'm looking for a way to do this statistically as well (i.e. to get comparison p values). Anybody have some insight? Thanks!

Reeza · Posted 01-27-2022 05:29 PM

You cannot do a paired t-test in this case but you can do a normal t-test.

If you match, and only include those are included you can do a paired t-test.

I would probably do both and then see what I got.

I would also look at the statistics of who was included in the first and second to ensure they're the same across demographic data and that you for example don't have all older individuals in the second sample rather than younger or all one gender. If they have the same distributions, you can somewhat assume they're representative and compare. But if they're not you may need to adjust them to account for that.

View solution in original post

Reeza · Posted 01-27-2022 05:29 PM

You cannot do a paired t-test in this case but you can do a normal t-test.

If you match, and only include those are included you can do a paired t-test.

I would probably do both and then see what I got.

I would also look at the statistics of who was included in the first and second to ensure they're the same across demographic data and that you for example don't have all older individuals in the second sample rather than younger or all one gender. If they have the same distributions, you can somewhat assume they're representative and compare. But if they're not you may need to adjust them to account for that.

ballardw · Posted 01-27-2022 06:16 PM

Maybe this can give you start. It creates two sets of common variables with a "measurement" variable randomly generated, combines the sets and adds an identification source variable and then uses the source as Class to identify the data group for a Ttest of the measurement variable.

data setone;
  do id=1 to 100;
     somevar = rand('normal');
     output;
  end;
run;
data settwo;
  do id=1001 to 1050;
     somevar = rand('normal');
     output;
  end;
run;

data totest;
  set setone (in=in1)
      settwo
  ;
  if in1 then Set='Baseline';
  else Set='Follow';
run;

proc ttest data=totest;
  var somevar;
  class set;
run;

Similar approach would work for other tests that might use the Class variable as an independent variable in a regression model.

With the random value chosen, specifically from the same distribution and sizes chosen I would be surprised to see a difference in the means. Change the second data step to Rand('normal', 0.3, 1.6) or similar and pretty likely to get a significant difference.

How do I statistically compare variables with different size datasets?

Re: How do I statistically compare variables with different size datasets?

Re: How do I statistically compare variables with different size datasets?

Re: How do I statistically compare variables with different size datasets?

How do I statistically compare variables with different size datasets?

Re: How do I statistically compare variables with different size datasets?

Re: How do I statistically compare variables with different size datasets?

Re: How do I statistically compare variables with different size datasets?

SAS Innovate 2025: Call for Content