Solved: Re: Missing data

mmraja · Posted 09-09-2015 01:29 PM

Hi there,

I'm having some trouble when I'm trying to make some new datasets.

I have survey data from 2 separate cohorts which are in two separate data sets. I've pooled them into one data set (called pooled) and pooled the variables of interest into new pooled variables.

The problem that I'm having is that the same variable from the cohort1 and pooled datasets have different number of missing values.

Example: there's a variable in cohort1 dataset called netcst1 with ~2000 missing values (out of ~600,000 so not really a large percent). The same variable that has been put into the pooled dataset has 500,000+ missing values, and I can't figure out why that is.

Syntax-wise, this is what I've done:

DATA pooled;
SET cohort1 cohort2;

....

/*I've created the new pooled variables I need here*/

...

RUN;

In cohort1 (and therefore in pooled), there is a variable called netcst1 (which I've verified through a PROC CONTENTS), so these two statements give me two different answers:

(I) proc means data = cohort1 nmiss;

var netcst1;

run;

(II) proc means data = pooled nmiss;

var netcst1;

run;

Shouldn't I be getting the same number of missing values for both? What have I done wrong for (I) to give me about 2000 missing values and for (II) to give me more than 500,000 missing values?

Thanks for your help

stevyfargose · Posted 09-09-2015 03:27 PM

Since it is a vertical join, the pro means is going to give result of total no of observation from both data set.

example:

data a;
input num VarA ;
datalines;
1 7
2 8
3 .
;
proc print data=a;
run;
data b;
input num VarB ;
datalines;
4 6
5 .
6 3
;
proc print data=b;
run;
proc means data = a n nmiss;
var VarA;
run;
proc means data = b n nmiss;
var VarB;
run;
data joined;
set a b;
run;
proc means data = joined n nmiss;
var _numeric_;
run;

the result:

Obs num VarA 1 2 3

1	7
2	8
3	.

Obs num VarB 1 2 3

4	6
5	.
6	3

The MEANS Procedure

Analysis Variable : VarA N N Miss

2

1

The MEANS Procedure

Analysis Variable : VarB N N Miss

2

1

The MEANS Procedure

Variable N N Miss

num

VarA

VarB

6

2

0

4

View solution in original post

ChrisWard · Posted 09-09-2015 01:44 PM

Is netcst1 on cohort2? if not how is netcst1 calculated for observations coming from cohort2?

mmraja · Posted 09-09-2015 02:22 PM

So the variable netcst1 exists in cohort1 only and then a different variable, call it netcst1a, exists only in cohort2. In the pooled data set, the variable poolednetcst brings them together.

ChrisWard · Posted 09-09-2015 02:49 PM

How many missing from cohort2?

proc means data = cohort2 nmiss;
    var netcst1a;
run;

mmraja · Posted 09-09-2015 03:15 PM

I believe 2400 or so values are missing from cohort2, which is completely reasonable considering the size of the dataset.

stevyfargose · Posted 09-09-2015 03:27 PM

Since it is a vertical join, the pro means is going to give result of total no of observation from both data set.

example:

data a;
input num VarA ;
datalines;
1 7
2 8
3 .
;
proc print data=a;
run;
data b;
input num VarB ;
datalines;
4 6
5 .
6 3
;
proc print data=b;
run;
proc means data = a n nmiss;
var VarA;
run;
proc means data = b n nmiss;
var VarB;
run;
data joined;
set a b;
run;
proc means data = joined n nmiss;
var _numeric_;
run;

the result:

Obs num VarA 1 2 3

1	7
2	8
3	.

Obs num VarB 1 2 3

4	6
5	.
6	3

The MEANS Procedure

Analysis Variable : VarA N N Miss

2

1

The MEANS Procedure

Analysis Variable : VarB N N Miss

2

1

The MEANS Procedure

Variable N N Miss

num

VarA

VarB

6

2

0

4

mmraja · Posted 09-09-2015 03:43 PM

Ohhhhh that makes so much sense now. Is there a way to get rid of that problem? Maybe if I used a MERGE statement rather than a SET statement?

stevyfargose · Posted 09-09-2015 04:00 PM

You can use Merge statement if you have common variable in both data set, but than you will have to sort your data first.

Tom · Posted 09-12-2015 07:56 AM

You need to add a CLASS statement to your PROC MEANS so the cohorts are treated as separate groups.

If you do not already have a variable that defines the cohort you could create one during the step that combines the data.

data pooled ;
   length cohort indsname $50 ;
   set cohort1 cohort2 indsname=indsname ;
   cohort=indsname ;
   ...
run;

proc means ;
   class cohort ;
...

Ready to join fellow brilliant minds for the SAS Hackathon?