BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mmraja
Fluorite | Level 6

Hi there,

 

I'm having some trouble when I'm trying to make some new datasets.

 

I have survey data from 2 separate cohorts which are in two separate data sets. I've pooled them into one data set (called pooled) and pooled the variables of interest into new pooled variables. 

 

The problem that I'm having is that the same variable from the cohort1 and pooled datasets have different number of missing values.

Example: there's a variable in cohort1 dataset called netcst1 with ~2000 missing values (out of ~600,000 so not really a large percent). The same variable that has been put into the pooled dataset has 500,000+ missing values, and I can't figure out why that is.

 

Syntax-wise, this is what I've done:

DATA pooled;
SET cohort1 cohort2;

 

....

/*I've created the new pooled variables I need here*/

...

RUN;

 

In cohort1 (and therefore in pooled), there is a variable called netcst1 (which I've verified through a PROC CONTENTS), so these two statements give me two different answers:

(I)  proc means data = cohort1 nmiss;

              var netcst1;

     run;

(II) proc means data = pooled nmiss;

              var netcst1;

     run;

 

Shouldn't I be getting the same number of missing values for both? What have I done wrong for (I) to give me about 2000 missing values and for (II) to give me more than 500,000 missing values?

 

Thanks for your help

 

1 ACCEPTED SOLUTION

Accepted Solutions
stevyfargose
Obsidian | Level 7

Since it is a vertical join, the pro means is going to give result of total no of observation from both data set.

example:

data a;
input num VarA ;
datalines;
1 7
2 8
3 .
;
proc print data=a;
run;
data b;
input num VarB ;
datalines;
4 6
5 .
6 3
;
proc print data=b;
run;
proc means data = a n nmiss;
  var VarA;
run;
proc means data = b n nmiss;
  var VarB;
run;
data joined;
 set a b;
 run;
 proc means data = joined n nmiss;
  var _numeric_;
run;

 

 

the result:

Obs num VarA 1 2 3

17
28
3.

Obs num VarB 1 2 3

46
5.
63

The MEANS Procedure

Analysis Variable : VarA N N Miss

21

The MEANS Procedure

Analysis Variable : VarB N N Miss

21

The MEANS Procedure

Variable N N Miss

num
VarA
VarB
6
2
2
0
4
4

View solution in original post

8 REPLIES 8
ChrisWard
SAS Employee
Is netcst1 on cohort2? if not how is netcst1 calculated for observations coming from cohort2?
mmraja
Fluorite | Level 6

So the variable netcst1 exists in cohort1 only and then a different variable, call it netcst1a, exists only in cohort2. In the pooled data set, the variable poolednetcst brings them together.

ChrisWard
SAS Employee

How many missing from cohort2?

proc means data = cohort2 nmiss;
    var netcst1a;
run;
mmraja
Fluorite | Level 6
I believe 2400 or so values are missing from cohort2, which is completely reasonable considering the size of the dataset.
stevyfargose
Obsidian | Level 7

Since it is a vertical join, the pro means is going to give result of total no of observation from both data set.

example:

data a;
input num VarA ;
datalines;
1 7
2 8
3 .
;
proc print data=a;
run;
data b;
input num VarB ;
datalines;
4 6
5 .
6 3
;
proc print data=b;
run;
proc means data = a n nmiss;
  var VarA;
run;
proc means data = b n nmiss;
  var VarB;
run;
data joined;
 set a b;
 run;
 proc means data = joined n nmiss;
  var _numeric_;
run;

 

 

the result:

Obs num VarA 1 2 3

17
28
3.

Obs num VarB 1 2 3

46
5.
63

The MEANS Procedure

Analysis Variable : VarA N N Miss

21

The MEANS Procedure

Analysis Variable : VarB N N Miss

21

The MEANS Procedure

Variable N N Miss

num
VarA
VarB
6
2
2
0
4
4
mmraja
Fluorite | Level 6

Ohhhhh that makes so much sense now. Is there a way to get rid of that problem? Maybe if I used a MERGE statement rather than a SET statement?

stevyfargose
Obsidian | Level 7

You can use Merge statement if you have common variable in both data set, but than you will have to sort your data first.

Tom
Super User Tom
Super User

You need to add a CLASS statement to your PROC MEANS so the cohorts are treated as separate groups.

If you do not already have a variable that defines the cohort you could create one during the step that combines the data.

 

data pooled ;
   length cohort indsname $50 ;
   set cohort1 cohort2 indsname=indsname ;
   cohort=indsname ;
   ...
run;

proc means ;
   class cohort ;
...

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 2252 views
  • 0 likes
  • 4 in conversation