Hi all, just wondering how I would statistically compare variables to each other if I had two datasets that had different amounts of observations. Essentially I'm trying to find if there are any significant differences between variables between a population at baseline and follow-up, the follow-up of which has only about half of the baseline participants. I'm looking for a way to do this statistically as well (i.e. to get comparison p values). Anybody have some insight? Thanks!
You cannot do a paired t-test in this case but you can do a normal t-test.
If you match, and only include those are included you can do a paired t-test.
I would probably do both and then see what I got.
I would also look at the statistics of who was included in the first and second to ensure they're the same across demographic data and that you for example don't have all older individuals in the second sample rather than younger or all one gender. If they have the same distributions, you can somewhat assume they're representative and compare. But if they're not you may need to adjust them to account for that.
You cannot do a paired t-test in this case but you can do a normal t-test.
If you match, and only include those are included you can do a paired t-test.
I would probably do both and then see what I got.
I would also look at the statistics of who was included in the first and second to ensure they're the same across demographic data and that you for example don't have all older individuals in the second sample rather than younger or all one gender. If they have the same distributions, you can somewhat assume they're representative and compare. But if they're not you may need to adjust them to account for that.
Maybe this can give you start. It creates two sets of common variables with a "measurement" variable randomly generated, combines the sets and adds an identification source variable and then uses the source as Class to identify the data group for a Ttest of the measurement variable.
data setone; do id=1 to 100; somevar = rand('normal'); output; end; run; data settwo; do id=1001 to 1050; somevar = rand('normal'); output; end; run; data totest; set setone (in=in1) settwo ; if in1 then Set='Baseline'; else Set='Follow'; run; proc ttest data=totest; var somevar; class set; run;
Similar approach would work for other tests that might use the Class variable as an independent variable in a regression model.
With the random value chosen, specifically from the same distribution and sizes chosen I would be surprised to see a difference in the means. Change the second data step to Rand('normal', 0.3, 1.6) or similar and pretty likely to get a significant difference.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.