DATA Step, Macro, Functions and more

removing duplicates

Accepted Solution Solved
Reply
Contributor
Posts: 29
Accepted Solution

removing duplicates

Hi,

I have a dataset with duplicates that I want to remove.

 

This is the dataset I have

ID      A     B        C

1       1     0         0

1       0     1         0

1       1     1         0

2       0      1         1

2       1      0         1

 

 And I would like to remove duplicates to look like

ID   A      B        C

1      1     1        0

2      1     1        1

 

I tried the code below

proc sort data=have nodupkey out=want;

by id a b c;

run;

 

The above code did not remove duplicates. Any other suggestions?

 

Thanks

 


Accepted Solutions
Solution
‎11-26-2017 01:42 PM
Contributor
Posts: 29

Re: removing duplicates

Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.

View solution in original post


All Replies
PROC Star
Posts: 549

Re: removing duplicates

something like this

 

data have;
input
ID A B C;
datalines;
1 1 0 0
1 0 1 0
1 1 1 0
2 0 1 1
2 1 0 1
;
run;


proc sql;
create table want as
select ID, max(A) as A, max(B) as B, max(c) as C
from have
group by ID;
quit;

Solution
‎11-26-2017 01:42 PM
Contributor
Posts: 29

Re: removing duplicates

Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.

Super User
Posts: 10,616

Re: removing duplicates

(Just showing one variable)

After sorting by id (which proc sql will do on its own):

data want;
set have (rename=(a=_a));
by id;
retain a;
if first.id then a = .;
a = max(_a,a);
drop _a;
run;
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 24

Re: removing duplicates

Posted in reply to KurtBremser

Hi Kurt,

 

I want to go from table 1 to table 2:

 

PRI

Col2

col3

col4

1

a

a

a

1

b

b

b

2

c

c

c

3

d

d

d

4

e

e

e

5

f

f

f

5

g

g

g

6

h

h

h

7

i

i

i

7

j

j

j

7

k

k

k

 

PRI

Col2

col3

col4

1

b

b

b

2

c

c

c

3

d

d

d

4

e

e

e

5

g

g

g

6

h

h

h

7

j

j

j

 

 

 

please ignore this table:

PRI

Col2

col3

col4

1

a

a

a

1

b

b

b

2

c

c

c

3

d

d

d

4

e

e

e

5

g

g

g

6

h

h

h

7

j

j

j

 

basically, I want to remove duplicate PRIs, it does not matter ,at this point, what other columns have.

 

I wrote this:

DATA nodups;

set selfID3;

by PRI;

if (first.PRI and last.PRI) then output;

run;

 

but it seems that the code gives me this as output:

PRI

Col2

col3

col4

2

c

c

c

3

d

d

d

4

e

e

e

6

h

h

h

basically, it removes any duplicate records.

how can I fix my code?

 

Thanks,

 

Nazanin

Super User
Posts: 10,616

Re: removing duplicates

Posted in reply to NazaninSAS

Use only

if last.pri

in the condition.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 24

Re: removing duplicates

Posted in reply to KurtBremser

what would be the difference between

Last.PRI

or

First.PRI

 

Thanks,

 

Nazanin

Super User
Posts: 10,616

Re: removing duplicates

Posted in reply to NazaninSAS

This is covered in the documentation of the by statement.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Frequent Contributor
Posts: 112

Re: removing duplicates

Posted in reply to KurtBremser

@KurtBremser:

 

IF LAST got lost somewhere in the woods Smiley Wink.

 

Paul D.

Frequent Contributor
Posts: 112

Re: removing duplicates

@leahcho:

 

Yes, it can be done, just with a bit more programming fuss than using SQL:

 

data have ;               
  input ID A B C ;        
  cards ;                 
1 1 0 0                   
1 0 1 0                   
1 1 1 0                   
2 0 1 1                   
2 1 0 1                   
run ;                     
                          
data want_by (drop = _:) ;
  do until (last.ID) ;    
    set have ;            
    by ID ;               
    array vv  A B C ;     
    array mm _m1-_m3 ;    
    do over vv ;          
      mm = mm <> vv ;     
    end ;                 
  end ;                   
  do over vv ;            
    vv = mm ;             
  end ;                   
run ;                     

However, as @KurtBremser has pointed out, it requires HAVE to be sorted or grouped by ID. If it's not sorted, it still can be done but needs a hash table to accumulate the maxima - and still more programming fuss:

 

data _null_ ;                                     
  dcl hash max (hashexp:0) ;                      
  max.definekey ("ID") ;                          
  max.definedata ("ID", "A", "B", "C") ;          
  max.definedone () ;                             
  do until (last) ;                               
    set have (rename=(A=m1 B=m2 C=m3)) end = last 
    array  vv  A  B  C ;                          
    array  mm  m1 - m3 ;                          
    rc = max.find() ;                             
    do over vv ;                                  
      if rc ne 0 then vv = mm ;                   
      else vv = mm <> vv ;                        
    end ;                                         
    max.replace() ;                               
  end ;                                           
  max.output (dataset:"want") ;                   
run ;                                             

Note that HASHEXP:0 directs the hash object to use a single (2**0=1) AVL tree. It has two effects:

 

(1) Keeps the table's base structure leaner (though it results in a slower search - but it's not a concern here).

(2) Results in the table being ordered by ID without coding the ORDERED:"A" argument tag. This is because with a single tree, the hash function has nothing else to point at, and the keys in an AVL tree are always in order.

 

HTH

Paul D.     

Frequent Contributor
Posts: 112

Re: removing duplicates

@kiranv_:

 

Kudos to an SQL head. Methinks, though, that in this case proc MEANS is just the tool since with it, there's no need to list the variables:

 

proc means nway noprint data = have ; 
  class ID ;                          
  output out = want (drop = _:) max= ;
run ;                                 

Best

Paul D.

Trusted Advisor
Posts: 1,270

Re: removing duplicates

Hi,

 

Data step solution:

 

proc stdize data=have outstat=stat(where=(_type_='SCALE')) method=maxabs;
by id;
run;

 

data want(drop=_type_ i);
set stat;
array v(*) a b c;
do i=1 to dim(v);
if v(i) = . then v(i)=0;
end;
run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 403 views
  • 3 likes
  • 6 in conversation