BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Corinthian94
Obsidian | Level 7

Hi all, just wondering how I would statistically compare variables to each other if I had two datasets that had different amounts of observations. Essentially I'm trying to find if there are any significant differences between variables between a population at baseline and follow-up, the follow-up of which has only about half of the baseline participants. I'm looking for a way to do this statistically as well (i.e. to get comparison p values). Anybody have some insight? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

You cannot do a paired t-test in this case but you can do a normal t-test. 

 

If you match, and only include those are included you can do a paired t-test.

 

I would probably do both and then see what I got. 

 

I would also look at the statistics of who was included in the first and second to ensure they're the same across demographic data and that you for example don't have all older individuals in the second sample rather than younger or all one gender. If they have the same distributions, you can somewhat assume they're representative and compare. But if they're not you may need to adjust them to account for that. 

View solution in original post

2 REPLIES 2
Reeza
Super User

You cannot do a paired t-test in this case but you can do a normal t-test. 

 

If you match, and only include those are included you can do a paired t-test.

 

I would probably do both and then see what I got. 

 

I would also look at the statistics of who was included in the first and second to ensure they're the same across demographic data and that you for example don't have all older individuals in the second sample rather than younger or all one gender. If they have the same distributions, you can somewhat assume they're representative and compare. But if they're not you may need to adjust them to account for that. 

ballardw
Super User

Maybe this can give you start. It creates two sets of common variables with a "measurement" variable randomly generated, combines the sets and adds an identification source variable and then uses the source as Class to identify the data group for a Ttest of the measurement variable.

data setone;
  do id=1 to 100;
     somevar = rand('normal');
     output;
  end;
run;
data settwo;
  do id=1001 to 1050;
     somevar = rand('normal');
     output;
  end;
run;

data totest;
  set setone (in=in1)
      settwo
  ;
  if in1 then Set='Baseline';
  else Set='Follow';
run;

proc ttest data=totest;
  var somevar;
  class set;
run; 

Similar approach would work for other tests that might use the Class variable as an independent variable in a regression model.

 

With the random value chosen, specifically from the same distribution and sizes chosen I would be surprised to see a difference in the means. Change the second data step to Rand('normal', 0.3, 1.6) or similar and pretty likely to get a significant difference.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1110 views
  • 2 likes
  • 3 in conversation