BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
natasha_nf
Fluorite | Level 6

Hi,

 

I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.

 

I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.

 

How should I proceed?

 

Thank you in advance,

 

Natasha Figueiredo

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

PROC COMPARE is not what is needed here. You aren't comparing line by line records, you're comparing statistical distributions and such. Usually the standard is to do t-tests and chi-square tests depending on the variable types. 

It's usually easier if you include them in the same data set. 

 

I would stack the data sets as follows:

 

data combined;

set included (in=A)
      excluded (in=B);
length status $12.;

status='Included';

if B then status='Excluded';

run;


*tests if age is the same across the two groups;
proc ttest data=combined;
class included;
var age;
run;

 

Hope that helps get you started.

 


@natasha_nf wrote:

Hi,

 

I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.

 

I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.

 

How should I proceed?

 

Thank you in advance,

 

Natasha Figueiredo

 

 


 

View solution in original post

7 REPLIES 7
VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

@natasha_nf see if this information assist you in how to prepare data for usage with proc compare.

https://www.sascrunch.com/proc-compare.html

 

Reeza
Super User

PROC COMPARE is not what is needed here. You aren't comparing line by line records, you're comparing statistical distributions and such. Usually the standard is to do t-tests and chi-square tests depending on the variable types. 

It's usually easier if you include them in the same data set. 

 

I would stack the data sets as follows:

 

data combined;

set included (in=A)
      excluded (in=B);
length status $12.;

status='Included';

if B then status='Excluded';

run;


*tests if age is the same across the two groups;
proc ttest data=combined;
class included;
var age;
run;

 

Hope that helps get you started.

 


@natasha_nf wrote:

Hi,

 

I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.

 

I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.

 

How should I proceed?

 

Thank you in advance,

 

Natasha Figueiredo

 

 


 

natasha_nf
Fluorite | Level 6

Thank you Reeza!

 

Actually I've done something slightly different. I merged the original (with all included and excluded individuals) and the "excluded" tables and created a binary "included" variable.


data incexc;
merge original (in=a) excluded (in=b);
if a and not b then included='1';
else included='0';
run;

 

Then I performed all the t and x2 tests I needed.

Reeza
Super User
That's the exact same thing, you just used a different word that included/excluded.

I preferred the words since that goes as labels for output 😉
Tom
Super User Tom
Super User

@natasha_nf wrote:

Thank you Reeza!

 

Actually I've done something slightly different. I merged the original (with all included and excluded individuals) and the "excluded" tables and created a binary "included" variable.


data incexc;
merge original (in=a) excluded (in=b);
if a and not b then included='1';
else included='0';
run;

 

Then I performed all the t and x2 tests I needed.


How could that MERGE work?  You did not include a BY variable.  So if ORIGINAL has 100 observations and EXCLUDED as 50 observations then the first 50 observations will be FROM both input datasets (A and B) and the last 50 will only be from the ORIGINAL dataset.  Is your ORIGINAL dataset really sorted so that all of the EXCLUDED observations are first?  Are they in the same order as the observations in EXCLUDED?  If not the the values of the common variables will be overwritten with the values from EXCLUDED dataset.

natasha_nf
Fluorite | Level 6

Hi Tom,

 

I must admit I don't really see what you mean, I'm sorry.

Anyway, I included the by ID variable and indeed, it changed the order the dataset is sorted. However, I did not affect my tests.

 

Thank you

Tom
Super User Tom
Super User

@natasha_nf wrote:

Hi Tom,

 

I must admit I don't really see what you mean, I'm sorry.

Anyway, I included the by ID variable and indeed, it changed the order the dataset is sorted. However, I did not affect my tests.

 

Thank you


To see the potential havoc of merging without a BY statement run this example and look at the the differences in the TEST1 and TEST2 datasets.

data original excluded ;
 set sashelp.class ;
 output original;
 if sex='M' then output excluded;
run;

data test1;
  merge original(in=in1) excluded(in=in2);
  excluded=in2;
run;

data test2;
  merge original(in=in1) excluded(in=in2);
  by name;
  excluded=in2;
run;

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2185 views
  • 3 likes
  • 4 in conversation