Removing ALL observations in by group as long as all entries within group are duplications

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

Removing ALL observations in by group as long as all entries within group are duplications

Hi guys, 

 

I am new to SAS and I am trying to remove groups if they fulfil two conditions. I currently have this data set:

ID ID_2 ID_3;

A 1 1

A 1 1

A 1 1

A 2 0

A 2 1

B 3 0

B 3 0

I am grouping by ID then by ID_2.

I want to remove ALL entries in the by groups as long as (1) there exists duplication across all three variables - I don't just want to remove the duplicates, I would like to remove the entire group AND (2) this duplication involves value '1' in ID_3 across all rows in each by group.

In other words, the outcome I want is:

ID ID_2 ID_3

A 2 0

A 2 1

B 3 0

B 3 0

I have spent at least 5 hours on this and I have tried various methods:

  • first. and last. (this does not guarantee that all observations in the by group match)

  • nodup (this method only removes the duplicates - I would like to remove even the first row of the group)

  • lag (again, the first row of the group stays which is not what I want)

I am open to using proc sql as well. Would really appreciate any input at all, thank you in advance!


Accepted Solutions
Solution
‎11-11-2016 03:55 AM
Super User
Posts: 7,422

Re: Removing ALL observations in by group as long as all entries within group are duplications

This might satisfy your requirements (at least it does with the example data):

data have;
input id $ id_2 id_3;
cards;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 3 0
B 3 0
;
run;

proc sort
  data=have
  out=int /* this is necessary to preserve have as is, can be deleted afterwards */
  noduprec
  dupout=check
;
by id id_2;
run;

data want;
merge
  have (in=a)
  check (in=b rename=(id_3=checkvar) where=(checkvar = 1))
;
by id id_2;
if a and not b;
drop checkvar;
run;

Note that have needs to be sorted by id id_2; initially.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Super User
Posts: 19,090

Re: Removing ALL observations in by group as long as all entries within group are duplications

You can use the having clause within PROC SQL to filter your data. 

 

Untested

 

proc sql;

select * 

from have

group by ID, ID2,ID3

having count(*) = 1 or max(ID3) eq 0;

quit; 

Occasional Contributor
Posts: 9

Re: Removing ALL observations in by group as long as all entries within group are duplications

Hi Reeza, 

 

Thank you for your reply. The method you suggested did not work. It did not change the original data. Also, would the condition max(ID3) eq 0 exclude those groups with a combination of 0 and 1 in ID3?

Super User
Posts: 19,090

Re: Removing ALL observations in by group as long as all entries within group are duplications

Did not work isn't helpful. How did it not work. I didn't create a dataset but you should have seen results that matched your requirements displayed. 

 

If iyou want to create a table add the following line after the proc SQL statement. This generates a new dataset called WANT with the desired output. If you have a combination of 0/1 in ID3 they are unique groups and won't be excluded. 

 

create table want as 

 

If it still 'doesn't work' explain in detail, include the code AND log, and examples of cases that didn't work if required. 

 

Occasional Contributor
Posts: 9

Re: Removing ALL observations in by group as long as all entries within group are duplications

I have tried this too but again, it didn't work.......

 

proc sort data = have;

by ID, ID2, ID3; 

run;

 

data nodups dups;

  set have;

  by ID, ID2, ID3;

  if first.ID3 and last.ID3 then output nodups;

  else output dups;

run;

 

data want invalid;
merge have (in=a) dups (in=b);
by ID ID2;
if b then output invalid;
else output want;
run;

Solution
‎11-11-2016 03:55 AM
Super User
Posts: 7,422

Re: Removing ALL observations in by group as long as all entries within group are duplications

This might satisfy your requirements (at least it does with the example data):

data have;
input id $ id_2 id_3;
cards;
A 1 1
A 1 1
A 1 1
A 2 0
A 2 1
B 3 0
B 3 0
;
run;

proc sort
  data=have
  out=int /* this is necessary to preserve have as is, can be deleted afterwards */
  noduprec
  dupout=check
;
by id id_2;
run;

data want;
merge
  have (in=a)
  check (in=b rename=(id_3=checkvar) where=(checkvar = 1))
;
by id id_2;
if a and not b;
drop checkvar;
run;

Note that have needs to be sorted by id id_2; initially.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 9

Re: Removing ALL observations in by group as long as all entries within group are duplications

Thank you so much Kurt! Your method definitely worked when I ran your codes. When I tried applying it to my actual code, however, the original dataset just does not seem to change at all, extremely odd!  

Super User
Posts: 7,422

Re: Removing ALL observations in by group as long as all entries within group are duplications

I never change the original dataset, as I create a "want" dataset in the final step; if you want to overwrite your "have", use it in the data statement of the final step.

Although I recommend to create a sequence of datasets, as intermediate results and the initial state are preserved as a fall-back in case something goes wrong.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 9

Re: Removing ALL observations in by group as long as all entries within group are duplications

Yes, I printed both want and have datasets but there seems to be no difference between the two. 

 

And yes, I like having a sequence of datasets, helps to track the steps. Smiley Happy

Super User
Posts: 7,422

Re: Removing ALL observations in by group as long as all entries within group are duplications


LovetoLearn wrote:

... applying it to my actual code, however,


so I think the issue lies in what is in that code. Only integrate different logical steps into one programmatic step when you can be sure that there are no unwanted interactions.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 9

Re: Removing ALL observations in by group as long as all entries within group are duplications

Oh, I see where the problem lies. There are no entries in the check dataset when applied to my actual code. My actual have dataset has more variables than just the 3. Would that mean that we can't use the noduprec?

Super User
Posts: 7,422

Re: Removing ALL observations in by group as long as all entries within group are duplications

Just use a keep= list on the have dataset when doing the sort. Only take the variables that are needed when creating the intermediate datasets.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 9

Re: Removing ALL observations in by group as long as all entries within group are duplications

[ Edited ]

Sorry, just to double-check, did you mean this? Say we have 2 extra variables: id_4 and id_5. 

 

proc sort 
data = have (keep = id_3 id_4 id_5);
by id id_2;
run;

proc
sort data=have out=int /* this is necessary to preserve have as is, can be deleted afterwards */ noduprec dupout=check ; by id id_2; run; data want; merge have (in=a) check (in=b rename=(id_3=checkvar) where=(checkvar = 1)) ; by id id_2; if a and not b; drop checkvar; run;

 

Super User
Posts: 7,422

Re: Removing ALL observations in by group as long as all entries within group are duplications

No, I meant that the first step should look like this:

proc sort
  data=have (keep=id id_2 id_3)
  out=int /* this is necessary to preserve have as is, can be deleted afterwards */
  noduprec
  dupout=check
;
by id id_2;
run;

Your initial have dataset should be sorted as a whole (all variables contained) by id and id_2, as that is needed for the final merge.

In your example date, this is the case.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Occasional Contributor
Posts: 9

Re: Removing ALL observations in by group as long as all entries within group are duplications

[ Edited ]

Hi Kurt, what if now the have dataset is as such?

(The new lines are in bold)

 

data have;
input id $ id_2 id_3 id_4$ id_5$ ;
cards;
A 1 1 a b
A 1 1 c d
A 1 1 e f
A 2 0 g h
A 2 0 g h
A 2 1 i j
A 2 1 i j
B 3 0 k l
B 3 0 m n
;
run;

 

The intended output dataset I want is:

data want;
input id $ id_2 id_3 id_4$ id_5$ ;
cards;
A 2 0 g h
A 2 0 g h
A 2 1 i j
A 2 1 i j
B 3 0 k l
B 3 0 m n
;
run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 17 replies
  • 490 views
  • 4 likes
  • 4 in conversation