Solved: subset dataset if all obs for a variable is not missing for an ID

d0816 · Posted 07-27-2018 02:48 PM

HI,

here is the sample dataset which I want to subset into dataset A and B. In dataset B, keep PINs if all obs for session is blank. If one of the Obs in "Session" for a PIN is not blank then keep all the obs for that PIN in dataset B. Please suggest.

dataset
ID	Level	Session
1	Z
2	X
2	Y
2	Z
3	Z
3	X
4	X	A
4	X
4	Y	B

dataset B
ID	Level	Session
1	Z
2	X
2	Y
2	Z
3	Z
3	X

dataset A
ID	Level	Session
4	X	A
4	X
4	Y	B

mkeintz · Posted 07-27-2018 03:00 PM

Merge a subset of non-blanks with the entire dataset:

dm 'clear log';
data have;
infile datalines missover;
input id	Level :$1.	Session :$1.;
datalines;
1	Z	 
2	X	 
2	Y	 
2	Z	 
3	Z	 
3	X	 
4	X	A
4	X	 
4	Y	B
run;
data a b;
   merge have (where=(session^=' ') in=anynonblanks)  have;
   by id;
   if anynonblanks then output b;
   else output a;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 07-27-2018 03:00 PM

Merge a subset of non-blanks with the entire dataset:

dm 'clear log';
data have;
infile datalines missover;
input id	Level :$1.	Session :$1.;
datalines;
1	Z	 
2	X	 
2	Y	 
2	Z	 
3	Z	 
3	X	 
4	X	A
4	X	 
4	Y	B
run;
data a b;
   merge have (where=(session^=' ') in=anynonblanks)  have;
   by id;
   if anynonblanks then output b;
   else output a;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

d0816 · Posted 07-27-2018 03:27 PM

This worked in my original dataset. Thank you so much.

novinosrin · Posted 07-27-2018 04:44 PM

Nice one @mkeintz how did 4 X "' " get to be true nonblanks. Where am i missing the point?

mkeintz · Posted 07-27-2018 08:38 PM

@novinosrin

There are two attributes of MERGE being utilized here:

Values in the "right" dataset supersede those in the "left", for all common variables.   In this case, for id 4 the "left" has the two non-blank observations, in this order:
  1st rec: 4 X A
  2nd rec: 4 Y B
but the "right" dataset has all the ID=4 observations in this order:
1st rec:    4 X A
  2nd rec: 4 X blank
  3rd rec: 4 X B

Within a BY group 1st rec matches with first rec, 2nd with 2nd, etc. So the merge results are
   1st rec:   4 X A   is "superseded" by 4 X A
   2nd rec: 4 Y B is "superseded" by 4 X blank
   3rd rec:   4 Y B (see below) is "supersede" by 4 X B
When either the left or right merged dataset is shorter within a BY group, then it's last observation is repeatedly matched to all the "excess" observations of the dataset with the longer BY group. That's why in my example of the 3rd rec above you see "4 Y B" propagated. And the "in=" parameter is also propagated. So the "excess" observation goes to the same destination as all the other ID=4 obs.

In fact, if you merge more than two datasets, then the principle is extended: for any common variable in multiple datasets, the rightmost value prevails, assuming the variable is of the same type (numeric or character) for all datasets. If a variable is of both types, the merge fails. In the case of other attributes (length, label, format), they are inherited from the leftmost dataset, since that's the first encounter the sas compiler has with the variable.

values supersede

What does MERGE X Y; BY ID; do when an ID group has fewer observations in (say) X than in Y. It propagates the last observation in X to be associated with the "excess" observations in Y.

When an ID group is being MERGEd, and there are no an equal number of observations in the "left" and "right" datasets,

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 07-27-2018 10:38 PM

@mkeintz Thank you so much for explaining at length. I really appreciate the privilege of receiving your time and knowledge.

subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Registration is open

Call for Content EXTENDED

subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Re: subset dataset if all obs for a variable is not missing for an ID

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away