Hi,
I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.
I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.
How should I proceed?
Thank you in advance,
Natasha Figueiredo
PROC COMPARE is not what is needed here. You aren't comparing line by line records, you're comparing statistical distributions and such. Usually the standard is to do t-tests and chi-square tests depending on the variable types.
It's usually easier if you include them in the same data set.
I would stack the data sets as follows:
data combined;
set included (in=A)
excluded (in=B);
length status $12.;
status='Included';
if B then status='Excluded';
run;
*tests if age is the same across the two groups;
proc ttest data=combined;
class included;
var age;
run;
Hope that helps get you started.
@natasha_nf wrote:
Hi,
I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.
I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.
How should I proceed?
Thank you in advance,
Natasha Figueiredo
@natasha_nf see if this information assist you in how to prepare data for usage with proc compare.
https://www.sascrunch.com/proc-compare.html
PROC COMPARE is not what is needed here. You aren't comparing line by line records, you're comparing statistical distributions and such. Usually the standard is to do t-tests and chi-square tests depending on the variable types.
It's usually easier if you include them in the same data set.
I would stack the data sets as follows:
data combined;
set included (in=A)
excluded (in=B);
length status $12.;
status='Included';
if B then status='Excluded';
run;
*tests if age is the same across the two groups;
proc ttest data=combined;
class included;
var age;
run;
Hope that helps get you started.
@natasha_nf wrote:
Hi,
I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.
I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.
How should I proceed?
Thank you in advance,
Natasha Figueiredo
Thank you Reeza!
Actually I've done something slightly different. I merged the original (with all included and excluded individuals) and the "excluded" tables and created a binary "included" variable.
data incexc;
merge original (in=a) excluded (in=b);
if a and not b then included='1';
else included='0';
run;
Then I performed all the t and x2 tests I needed.
@natasha_nf wrote:
Thank you Reeza!
Actually I've done something slightly different. I merged the original (with all included and excluded individuals) and the "excluded" tables and created a binary "included" variable.
data incexc;
merge original (in=a) excluded (in=b);
if a and not b then included='1';
else included='0';
run;
Then I performed all the t and x2 tests I needed.
How could that MERGE work? You did not include a BY variable. So if ORIGINAL has 100 observations and EXCLUDED as 50 observations then the first 50 observations will be FROM both input datasets (A and B) and the last 50 will only be from the ORIGINAL dataset. Is your ORIGINAL dataset really sorted so that all of the EXCLUDED observations are first? Are they in the same order as the observations in EXCLUDED? If not the the values of the common variables will be overwritten with the values from EXCLUDED dataset.
Hi Tom,
I must admit I don't really see what you mean, I'm sorry.
Anyway, I included the by ID variable and indeed, it changed the order the dataset is sorted. However, I did not affect my tests.
Thank you
@natasha_nf wrote:
Hi Tom,
I must admit I don't really see what you mean, I'm sorry.
Anyway, I included the by ID variable and indeed, it changed the order the dataset is sorted. However, I did not affect my tests.
Thank you
To see the potential havoc of merging without a BY statement run this example and look at the the differences in the TEST1 and TEST2 datasets.
data original excluded ;
set sashelp.class ;
output original;
if sex='M' then output excluded;
run;
data test1;
merge original(in=in1) excluded(in=in2);
excluded=in2;
run;
data test2;
merge original(in=in1) excluded(in=in2);
by name;
excluded=in2;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.