SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Missing data

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 11
Accepted Solution

Missing data

Hi there,

 

I'm having some trouble when I'm trying to make some new datasets.

 

I have survey data from 2 separate cohorts which are in two separate data sets. I've pooled them into one data set (called pooled) and pooled the variables of interest into new pooled variables. 

 

The problem that I'm having is that the same variable from the cohort1 and pooled datasets have different number of missing values.

Example: there's a variable in cohort1 dataset called netcst1 with ~2000 missing values (out of ~600,000 so not really a large percent). The same variable that has been put into the pooled dataset has 500,000+ missing values, and I can't figure out why that is.

 

Syntax-wise, this is what I've done:

DATA pooled;
SET cohort1 cohort2;

 

....

/*I've created the new pooled variables I need here*/

...

RUN;

 

In cohort1 (and therefore in pooled), there is a variable called netcst1 (which I've verified through a PROC CONTENTS), so these two statements give me two different answers:

(I)  proc means data = cohort1 nmiss;

              var netcst1;

     run;

(II) proc means data = pooled nmiss;

              var netcst1;

     run;

 

Shouldn't I be getting the same number of missing values for both? What have I done wrong for (I) to give me about 2000 missing values and for (II) to give me more than 500,000 missing values?

 

Thanks for your help

 


Accepted Solutions
Solution
‎09-25-2015 06:23 AM
Occasional Contributor
Posts: 12

Re: Missing data

Since it is a vertical join, the pro means is going to give result of total no of observation from both data set.

example:

data a;
input num VarA ;
datalines;
1 7
2 8
3 .
;
proc print data=a;
run;
data b;
input num VarB ;
datalines;
4 6
5 .
6 3
;
proc print data=b;
run;
proc means data = a n nmiss;
  var VarA;
run;
proc means data = b n nmiss;
  var VarB;
run;
data joined;
 set a b;
 run;
 proc means data = joined n nmiss;
  var _numeric_;
run;

 

 

the result:

Obs num VarA 1 2 3

17
28
3.

Obs num VarB 1 2 3

46
5.
63

The MEANS Procedure

Analysis Variable : VarA N N Miss

21

The MEANS Procedure

Analysis Variable : VarB N N Miss

21

The MEANS Procedure

Variable N N Miss

num
VarA
VarB
6
2
2
0
4
4

View solution in original post


All Replies
SAS Employee
Posts: 12

Re: Missing data

Is netcst1 on cohort2? if not how is netcst1 calculated for observations coming from cohort2?
Occasional Contributor
Posts: 11

Re: Missing data

So the variable netcst1 exists in cohort1 only and then a different variable, call it netcst1a, exists only in cohort2. In the pooled data set, the variable poolednetcst brings them together.

SAS Employee
Posts: 12

Re: Missing data

How many missing from cohort2?

proc means data = cohort2 nmiss;
    var netcst1a;
run;
Occasional Contributor
Posts: 11

Re: Missing data

I believe 2400 or so values are missing from cohort2, which is completely reasonable considering the size of the dataset.
Solution
‎09-25-2015 06:23 AM
Occasional Contributor
Posts: 12

Re: Missing data

Since it is a vertical join, the pro means is going to give result of total no of observation from both data set.

example:

data a;
input num VarA ;
datalines;
1 7
2 8
3 .
;
proc print data=a;
run;
data b;
input num VarB ;
datalines;
4 6
5 .
6 3
;
proc print data=b;
run;
proc means data = a n nmiss;
  var VarA;
run;
proc means data = b n nmiss;
  var VarB;
run;
data joined;
 set a b;
 run;
 proc means data = joined n nmiss;
  var _numeric_;
run;

 

 

the result:

Obs num VarA 1 2 3

17
28
3.

Obs num VarB 1 2 3

46
5.
63

The MEANS Procedure

Analysis Variable : VarA N N Miss

21

The MEANS Procedure

Analysis Variable : VarB N N Miss

21

The MEANS Procedure

Variable N N Miss

num
VarA
VarB
6
2
2
0
4
4
Occasional Contributor
Posts: 11

Re: Missing data

Ohhhhh that makes so much sense now. Is there a way to get rid of that problem? Maybe if I used a MERGE statement rather than a SET statement?

Occasional Contributor
Posts: 12

Re: Missing data

You can use Merge statement if you have common variable in both data set, but than you will have to sort your data first.

Super User
Super User
Posts: 6,499

Re: Missing data

You need to add a CLASS statement to your PROC MEANS so the cohorts are treated as separate groups.

If you do not already have a variable that defines the cohort you could create one during the step that combines the data.

 

data pooled ;
   length cohort indsname $50 ;
   set cohort1 cohort2 indsname=indsname ;
   cohort=indsname ;
   ...
run;

proc means ;
   class cohort ;
...
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 462 views
  • 0 likes
  • 4 in conversation