BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
leahcho
Obsidian | Level 7

Hi,

I have a dataset with duplicates that I want to remove.

 

This is the dataset I have

ID      A     B        C

1       1     0         0

1       0     1         0

1       1     1         0

2       0      1         1

2       1      0         1

 

 And I would like to remove duplicates to look like

ID   A      B        C

1      1     1        0

2      1     1        1

 

I tried the code below

proc sort data=have nodupkey out=want;

by id a b c;

run;

 

The above code did not remove duplicates. Any other suggestions?

 

Thanks

 

1 ACCEPTED SOLUTION

Accepted Solutions
leahcho
Obsidian | Level 7

Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.

View solution in original post

11 REPLIES 11
kiranv_
Rhodochrosite | Level 12

something like this

 

data have;
input
ID A B C;
datalines;
1 1 0 0
1 0 1 0
1 1 1 0
2 0 1 1
2 1 0 1
;
run;


proc sql;
create table want as
select ID, max(A) as A, max(B) as B, max(c) as C
from have
group by ID;
quit;

leahcho
Obsidian | Level 7

Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.

Kurt_Bremser
Super User

(Just showing one variable)

After sorting by id (which proc sql will do on its own):

data want;
set have (rename=(a=_a));
by id;
retain a;
if first.id then a = .;
a = max(_a,a);
drop _a;
run;
NazaninSAS
Quartz | Level 8

Hi Kurt,

 

I want to go from table 1 to table 2:

 

PRI

Col2

col3

col4

1

a

a

a

1

b

b

b

2

c

c

c

3

d

d

d

4

e

e

e

5

f

f

f

5

g

g

g

6

h

h

h

7

i

i

i

7

j

j

j

7

k

k

k

 

PRI

Col2

col3

col4

1

b

b

b

2

c

c

c

3

d

d

d

4

e

e

e

5

g

g

g

6

h

h

h

7

j

j

j

 

 

 

please ignore this table:

PRI

Col2

col3

col4

1

a

a

a

1

b

b

b

2

c

c

c

3

d

d

d

4

e

e

e

5

g

g

g

6

h

h

h

7

j

j

j

 

basically, I want to remove duplicate PRIs, it does not matter ,at this point, what other columns have.

 

I wrote this:

DATA nodups;

set selfID3;

by PRI;

if (first.PRI and last.PRI) then output;

run;

 

but it seems that the code gives me this as output:

PRI

Col2

col3

col4

2

c

c

c

3

d

d

d

4

e

e

e

6

h

h

h

basically, it removes any duplicate records.

how can I fix my code?

 

Thanks,

 

Nazanin

NazaninSAS
Quartz | Level 8

what would be the difference between

Last.PRI

or

First.PRI

 

Thanks,

 

Nazanin

hashman
Ammonite | Level 13

@Kurt_Bremser:

 

IF LAST got lost somewhere in the woods ;).

 

Paul D.

hashman
Ammonite | Level 13

@leahcho:

 

Yes, it can be done, just with a bit more programming fuss than using SQL:

 

data have ;               
  input ID A B C ;        
  cards ;                 
1 1 0 0                   
1 0 1 0                   
1 1 1 0                   
2 0 1 1                   
2 1 0 1                   
run ;                     
                          
data want_by (drop = _:) ;
  do until (last.ID) ;    
    set have ;            
    by ID ;               
    array vv  A B C ;     
    array mm _m1-_m3 ;    
    do over vv ;          
      mm = mm <> vv ;     
    end ;                 
  end ;                   
  do over vv ;            
    vv = mm ;             
  end ;                   
run ;                     

However, as @Kurt_Bremser has pointed out, it requires HAVE to be sorted or grouped by ID. If it's not sorted, it still can be done but needs a hash table to accumulate the maxima - and still more programming fuss:

 

data _null_ ;                                     
  dcl hash max (hashexp:0) ;                      
  max.definekey ("ID") ;                          
  max.definedata ("ID", "A", "B", "C") ;          
  max.definedone () ;                             
  do until (last) ;                               
    set have (rename=(A=m1 B=m2 C=m3)) end = last 
    array  vv  A  B  C ;                          
    array  mm  m1 - m3 ;                          
    rc = max.find() ;                             
    do over vv ;                                  
      if rc ne 0 then vv = mm ;                   
      else vv = mm <> vv ;                        
    end ;                                         
    max.replace() ;                               
  end ;                                           
  max.output (dataset:"want") ;                   
run ;                                             

Note that HASHEXP:0 directs the hash object to use a single (2**0=1) AVL tree. It has two effects:

 

(1) Keeps the table's base structure leaner (though it results in a slower search - but it's not a concern here).

(2) Results in the table being ordered by ID without coding the ORDERED:"A" argument tag. This is because with a single tree, the hash function has nothing else to point at, and the keys in an AVL tree are always in order.

 

HTH

Paul D.     

hashman
Ammonite | Level 13

@kiranv_:

 

Kudos to an SQL head. Methinks, though, that in this case proc MEANS is just the tool since with it, there's no need to list the variables:

 

proc means nway noprint data = have ; 
  class ID ;                          
  output out = want (drop = _:) max= ;
run ;                                 

Best

Paul D.

stat_sas
Ammonite | Level 13

Hi,

 

Data step solution:

 

proc stdize data=have outstat=stat(where=(_type_='SCALE')) method=maxabs;
by id;
run;

 

data want(drop=_type_ i);
set stat;
array v(*) a b c;
do i=1 to dim(v);
if v(i) = . then v(i)=0;
end;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1310 views
  • 3 likes
  • 6 in conversation