Solved: Re: Remove duplicates

adityaa9z · Posted 07-24-2018 10:05 AM

Hi team,

I have a table HAVE with lets say two variables

X Y

1 A

2 A

3 A

4 B

5 C

6 C

7 D

In the output WANT I only want the ones which have distinct Y value

OUT

4 B

7 D

novinosrin · Posted 07-24-2018 10:10 AM

data have;
input X Y $;
cards;
1 A
2 A
3 A
4 B
5 C
6 C
7 D
;
 
data want;

set have;

by y;

if last.y and first.y;

run;

assuming your dataset is sorted by y

data want;

set have;

by y;

if last.y and first.y;

run;

View solution in original post

novinosrin · Posted 07-24-2018 10:10 AM

data have;
input X Y $;
cards;
1 A
2 A
3 A
4 B
5 C
6 C
7 D
;
 
data want;

set have;

by y;

if last.y and first.y;

run;

assuming your dataset is sorted by y

data want;

set have;

by y;

if last.y and first.y;

run;

adityaa9z · Posted 07-24-2018 02:47 PM

Thanks. This should work since my DS is sorted by Y

SuryaKiran · Posted 07-24-2018 10:14 AM

Subquery for selecting the records with count 1 and filtering only those.

data  have;
input X :$1. Y :$1.;
datalines;
1 A
2 A
3 A
4 B
5 C
6 C
7 D
;
run;

proc sql;
select * from have
where y in ( select y 
				from have
				group by y
				having count(y)=1);
quit;

Thanks,
Suryakiran

novinosrin · Posted 07-24-2018 10:16 AM

data have;
input X Y $;
cards;
1 A
2 A
3 A
4 B
5 C
6 C
7 D
;

proc sql;
create table want as
select *
from have
group by y
having count(y)=1;
quit;

Reeza · Posted 07-24-2018 10:49 AM

Look at the UNIQUERECS option in PROC SORT.

SuryaKiran · Posted 07-24-2018 11:26 AM

@Reeza I don't think there is UNIQUERECS option for proc sort like NOUNIQUERECS. But, NOUNIQUEKEY with UNIQUEOUT= would give all the unique records. Correct me if I'm wrong.

proc sort data=have NOUNIQUEKEY  UNIQUEOUT=uniq  out=dups;;
by y;
run;

Thanks,
Suryakiran

Reeza · Posted 07-24-2018 02:37 PM

@SuryaKiran you're correct.

I think this was from a few years ago but essentially explains it:

https://chemicalstatistician.wordpress.com/2015/04/10/separating-unique-and-duplicate-variables-usin...

I thought there was some new features in 9.4 but I don't always recall things correctly for sure!

hashman · Posted 07-24-2018 07:26 PM

You've already received a number of nice solutions good both if your input is sorted and not (SQL). I've decided to chime in because the task is interesting from the standpoint of logic needed when a hash table is used for unduplication - which is not quite unnatural for someone whose nick is "hashman".

Normally, this logic goes as follows:

1. If the current record key-value is not in the table, store it there and output the record.

2. Otherwise, proceed to the next record.

However, in this case, it will not work, since what you need to output is not every record where a key-value is encountered first but only the records where the key is unique throughout the entire data set. So, what will work? Here's your sample input I've taken the liberty to augment with extra three dummy variables D1-D3 (to illustrate an extra variation on the theme downstream):

data have ;                     
  input X Y:$1. (D1-D3) (:$3.) ;
  cards ;                       
1 A 1A1 1A2 1A3                 
2 A 2A1 2A2 2A3                 
3 A 3A1 3A2 3A3                 
4 B 4B1 4B2 4B3                 
5 C 5C1 5C2 5C3                 
6 C 6C1 6C2 6C3                 
7 D 7D1 7D2 7D3                 
run ;

And here's one way to use the hash object to get what we want:

data _null_ ;                                       
  dcl hash h () ;                                   
  h.definekey ("y") ;                               
  h.definedata ("x", "y", "d1", "d2", "d3", "_n_") ;
  h.definedone () ;                                 
  do until (z) ;                                    
    set have end = z ;                              
    _n_ = h.check() ne 0 ;                          
    h.replace() ;                                   
  end ;                                             
  h.output (dataset:"want(where=(_n_))") ;          
run ;

How does this logic work? Well, we still store every unique key-value of Y in the table (along with the rest of the input variables), but at the same time, we mark the hash items corresponding to the input records unique by Y as _N_=1 and the rest - with _N_=0. Note that, maybe somewhat counterintuitively, the variable _N_ ends up in the output data set WANT, as the OUTPUT method doesn't drop it automatically (and we don't want to drop it, either, since it's in the WHERE clause). But it's not a big deal since in a subsequent step where WANT is used as input, _N_ will be auto-dropped,

Now, the reason I have included the extra satellite variables in the sample input is to illustrate that in the approach above, in order to appear in the output all of them need to be included in the data portion of the hash table. Now imagine that you have not 3 but, say, 30 satellite variables. The problem that their names (with all the quotes around them) are tedious to type is rather minor since it can be easily automated programmatically. A bigger issue is that it would overburden the hash table with extra memory footprint. This issue can be circumvented with what I call using a hash index. To wit, all we need to store in the data portion in addition to _N_ is the input record ID (aka the observation number) and proceed as follows:

data want ; 
  dcl hash h () ; 
  h.definekey ("y") ; 
  h.definedata ("p", "_n_") ; 
  h.definedone () ; 
  do p = 1 by 1 until (z) ; 
    set have (keep = y) end = z ;
    _n_ = h.check() ne 0 ; 
    h.replace() ; 
  end ; 
  dcl hiter ih ("h") ; 
  do while (ih.next() = 0) ; 
    if not _n_ then continue ; 
    set have point = p ; 
    output ; 
  end ; 
run ;

Note that in the first pass, only the key variable Y is kept, and we get the rest from the same file only for those records we need (i.e. where Y is unique). Needless to say, it will work best in the situations when the input record is wide and the number of records with unique Y is relatively small.

The reason I call this latter approach "using a hash index" should be obvious: The hash table H above is, well, nothing but an index into the data set HAVE containing the search key and corresponding RID. The only differences between it and a "real" SAS index is that it (a) doesn't need an index file, (b) is memory-resident, (c) doesn't persist beyond the DATA step, and (d) is much faster.

Paul D.

novinosrin · Posted 07-24-2018 08:52 PM

@hashman Thanks *infinite repetitions. I am happy that you are chiming in here. Wish Ian Whit lock did that too, would be a spectacle

Anyways, are these topics covered at length in the book

https://communities.sas.com/t5/Community-Matters/Book-Data-Management-Solutions-Using-SAS-Hash-Table...

Just curious? plz

hashman · Posted 07-24-2018 10:51 PM

You're very welcome. And many thanks for the plug.

I sorely miss Ian's wisdom both in this space and on SAS-L (or rather on what's left of its former splendor).

Using hash tables for different unduplication scenarios is indeed covered in the book fairly extensively. However, it's impossible to account for all possible situations within the scope of one book. That's why a plan is underway by SAS Press for Don and me to cover many hash-related (and not directly hash-related but relevant to the code examples used in the book) subtopics in blog posts linked to the book on the author page.

The question raised in this thread has a curious logical twist deserving its own blog post, so I'm planning to make my answer here in one form or another into a separate blog entry there. Don has already been working on a few blog posts and I have, too. This way, the book, though having already been published, will continue to be a sort of a "living and breathing document".

Best regards

Paul D.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!