Solved: Re: removing duplicates

leahcho · Posted 11-25-2017 09:42 PM

Hi,

I have a dataset with duplicates that I want to remove.

This is the dataset I have

ID A B C

1 1 0 0

1 0 1 0

1 1 1 0

2 0 1 1

2 1 0 1

And I would like to remove duplicates to look like

ID A B C

1 1 1 0

2 1 1 1

I tried the code below

proc sort data=have nodupkey out=want;

by id a b c;

run;

The above code did not remove duplicates. Any other suggestions?

Thanks

leahcho · Posted 11-26-2017 01:42 PM

Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.

View solution in original post

kiranv_ · Posted 11-25-2017 10:47 PM

something like this

data have;
input
ID A B C;
datalines;
1 1 0 0
1 0 1 0
1 1 1 0
2 0 1 1
2 1 0 1
;
run;

proc sql;
create table want as
select ID, max(A) as A, max(B) as B, max(c) as C
from have
group by ID;
quit;

leahcho · Posted 11-26-2017 01:42 PM

Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.

Kurt_Bremser · Posted 11-26-2017 02:47 PM

(Just showing one variable)

After sorting by id (which proc sql will do on its own):

data want;
set have (rename=(a=_a));
by id;
retain a;
if first.id then a = .;
a = max(_a,a);
drop _a;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

NazaninSAS · Posted 08-03-2018 12:17 PM

Hi Kurt,

I want to go from table 1 to table 2:

PRI	Col2	col3	col4
1	a	a	a
1	b	b	b
2	c	c	c
3	d	d	d
4	e	e	e
5	f	f	f
5	g	g	g
6	h	h	h
7	i	i	i
7	j	j	j
7	k	k	k

PRI	Col2	col3	col4
1	b	b	b
2	c	c	c
3	d	d	d
4	e	e	e
5	g	g	g
6	h	h	h
7	j	j	j

~~please ignore this table:~~

~~PRI~~	~~Col2~~	~~col3~~	~~col4~~
1	a	a	a
1	b	b	b
2	c	c	c
3	d	d	d
4	e	e	e
5	g	g	g
6	h	h	h
7	j	j	j

basically, I want to remove duplicate PRIs, it does not matter ,at this point, what other columns have.

I wrote this:

DATA nodups;

set selfID3;

by PRI;

if (first.PRI and last.PRI) then output;

run;

but it seems that the code gives me this as output:

PRI	Col2	col3	col4
2	c	c	c
3	d	d	d
4	e	e	e
6	h	h	h

basically, it removes any duplicate records.

how can I fix my code?

Thanks,

Nazanin

Kurt_Bremser · Posted 08-03-2018 12:24 PM

Use only

if last.pri

in the condition.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

NazaninSAS · Posted 08-03-2018 01:08 PM

what would be the difference between

Last.PRI

or

First.PRI

Thanks,

Nazanin

Kurt_Bremser · Posted 08-03-2018 01:37 PM

This is covered in the documentation of the by statement.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

hashman · Posted 08-03-2018 07:26 PM

@Kurt_Bremser:

IF LAST got lost somewhere in the woods ;).

Paul D.

hashman · Posted 08-03-2018 08:12 PM

@leahcho:

Yes, it can be done, just with a bit more programming fuss than using SQL:

data have ;               
  input ID A B C ;        
  cards ;                 
1 1 0 0                   
1 0 1 0                   
1 1 1 0                   
2 0 1 1                   
2 1 0 1                   
run ;                     
                          
data want_by (drop = _:) ;
  do until (last.ID) ;    
    set have ;            
    by ID ;               
    array vv  A B C ;     
    array mm _m1-_m3 ;    
    do over vv ;          
      mm = mm <> vv ;     
    end ;                 
  end ;                   
  do over vv ;            
    vv = mm ;             
  end ;                   
run ;

However, as @Kurt_Bremser has pointed out, it requires HAVE to be sorted or grouped by ID. If it's not sorted, it still can be done but needs a hash table to accumulate the maxima - and still more programming fuss:

data _null_ ;                                     
  dcl hash max (hashexp:0) ;                      
  max.definekey ("ID") ;                          
  max.definedata ("ID", "A", "B", "C") ;          
  max.definedone () ;                             
  do until (last) ;                               
    set have (rename=(A=m1 B=m2 C=m3)) end = last 
    array  vv  A  B  C ;                          
    array  mm  m1 - m3 ;                          
    rc = max.find() ;                             
    do over vv ;                                  
      if rc ne 0 then vv = mm ;                   
      else vv = mm <> vv ;                        
    end ;                                         
    max.replace() ;                               
  end ;                                           
  max.output (dataset:"want") ;                   
run ;

Note that HASHEXP:0 directs the hash object to use a single (2**0=1) AVL tree. It has two effects:

(1) Keeps the table's base structure leaner (though it results in a slower search - but it's not a concern here).

(2) Results in the table being ordered by ID without coding the ORDERED:"A" argument tag. This is because with a single tree, the hash function has nothing else to point at, and the keys in an AVL tree are always in order.

HTH

Paul D.

hashman · Posted 08-03-2018 07:31 PM

@kiranv_:

Kudos to an SQL head. Methinks, though, that in this case proc MEANS is just the tool since with it, there's no need to list the variables:

proc means nway noprint data = have ; 
  class ID ;                          
  output out = want (drop = _:) max= ;
run ;

Best

Paul D.

stat_sas · Posted 11-26-2017 07:46 PM

Hi,

Data step solution:

proc stdize data=have outstat=stat(where=(_type_='SCALE')) method=maxabs;
by id;
run;

data want(drop=_type_ i);
set stat;
array v(*) a b c;
do i=1 to dim(v);
if v(i) = . then v(i)=0;
end;
run;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away