Solved: Re: How can I compare variables from different tables?

natasha_nf · Posted 05-27-2019 09:15 AM

Hi,

I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.

I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.

How should I proceed?

Thank you in advance,

Natasha Figueiredo

Reeza · Posted 05-27-2019 11:16 AM

PROC COMPARE is not what is needed here. You aren't comparing line by line records, you're comparing statistical distributions and such. Usually the standard is to do t-tests and chi-square tests depending on the variable types.

It's usually easier if you include them in the same data set.

I would stack the data sets as follows:

data combined;

set included (in=A)
      excluded (in=B);
length status $12.;

status='Included';

if B then status='Excluded';

run;


*tests if age is the same across the two groups;
proc ttest data=combined;
class included;
var age;
run;

Hope that helps get you started.

@natasha_nf wrote:

Hi,

I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.

I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.

How should I proceed?

Thank you in advance,

Natasha Figueiredo

View solution in original post

VDD · Posted 05-27-2019 09:27 AM

@natasha_nf see if this information assist you in how to prepare data for usage with proc compare.

https://www.sascrunch.com/proc-compare.html

Reeza · Posted 05-27-2019 11:16 AM

PROC COMPARE is not what is needed here. You aren't comparing line by line records, you're comparing statistical distributions and such. Usually the standard is to do t-tests and chi-square tests depending on the variable types.

It's usually easier if you include them in the same data set.

I would stack the data sets as follows:

data combined;

set included (in=A)
      excluded (in=B);
length status $12.;

status='Included';

if B then status='Excluded';

run;


*tests if age is the same across the two groups;
proc ttest data=combined;
class included;
var age;
run;

Hope that helps get you started.

@natasha_nf wrote:

Hi,

I would like to compare the sociodemographic characteristics (mean age, bmi, education distribution, etc.) between individuals included and excluded (due to aberrant/missing data) from my analysis.

I have a table "included" with around 43k individuals and a table "excluded" with around 5k individuals.

How should I proceed?

Thank you in advance,

Natasha Figueiredo

natasha_nf · Posted 05-27-2019 11:27 AM

Thank you Reeza!

Actually I've done something slightly different. I merged the original (with all included and excluded individuals) and the "excluded" tables and created a binary "included" variable.

data incexc;
merge original (in=a) excluded (in=b);
if a and not b then included='1';
else included='0';
run;

Then I performed all the t and x2 tests I needed.

Reeza · Posted 05-27-2019 01:59 PM

That's the exact same thing, you just used a different word that included/excluded.

I preferred the words since that goes as labels for output 😉

Tom · Posted 05-27-2019 02:08 PM

@natasha_nf wrote:

Thank you Reeza!

Actually I've done something slightly different. I merged the original (with all included and excluded individuals) and the "excluded" tables and created a binary "included" variable.

data incexc;
merge original (in=a) excluded (in=b);
if a and not b then included='1';
else included='0';
run;

Then I performed all the t and x2 tests I needed.

How could that MERGE work? You did not include a BY variable. So if ORIGINAL has 100 observations and EXCLUDED as 50 observations then the first 50 observations will be FROM both input datasets (A and B) and the last 50 will only be from the ORIGINAL dataset. Is your ORIGINAL dataset really sorted so that all of the EXCLUDED observations are first? Are they in the same order as the observations in EXCLUDED? If not the the values of the common variables will be overwritten with the values from EXCLUDED dataset.

natasha_nf · Posted 05-28-2019 05:34 AM

Hi Tom,

I must admit I don't really see what you mean, I'm sorry.

Anyway, I included the by ID variable and indeed, it changed the order the dataset is sorted. However, I did not affect my tests.

Thank you

Tom · Posted 05-28-2019 09:11 AM

@natasha_nf wrote:

Hi Tom,

I must admit I don't really see what you mean, I'm sorry.

Anyway, I included the by ID variable and indeed, it changed the order the dataset is sorted. However, I did not affect my tests.

Thank you

To see the potential havoc of merging without a BY statement run this example and look at the the differences in the TEST1 and TEST2 datasets.

data original excluded ;
 set sashelp.class ;
 output original;
 if sex='M' then output excluded;
run;

data test1;
  merge original(in=in1) excluded(in=in2);
  excluded=in2;
run;

data test2;
  merge original(in=in1) excluded(in=in2);
  by name;
  excluded=in2;
run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away