Hi,
I have a dataset with duplicates that I want to remove.
This is the dataset I have
ID A B C
1 1 0 0
1 0 1 0
1 1 1 0
2 0 1 1
2 1 0 1
And I would like to remove duplicates to look like
ID A B C
1 1 1 0
2 1 1 1
I tried the code below
proc sort data=have nodupkey out=want;
by id a b c;
run;
The above code did not remove duplicates. Any other suggestions?
Thanks
Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.
something like this
data have;
input
ID A B C;
datalines;
1 1 0 0
1 0 1 0
1 1 1 0
2 0 1 1
2 1 0 1
;
run;
proc sql;
create table want as
select ID, max(A) as A, max(B) as B, max(c) as C
from have
group by ID;
quit;
Thanks. It worked very nicely. But is it possible to do the same with data step? I am not familiar with Proc SQL.
(Just showing one variable)
After sorting by id (which proc sql will do on its own):
data want;
set have (rename=(a=_a));
by id;
retain a;
if first.id then a = .;
a = max(_a,a);
drop _a;
run;
Hi Kurt,
I want to go from table 1 to table 2:
PRI | Col2 | col3 | col4 |
1 | a | a | a |
1 | b | b | b |
2 | c | c | c |
3 | d | d | d |
4 | e | e | e |
5 | f | f | f |
5 | g | g | g |
6 | h | h | h |
7 | i | i | i |
7 | j | j | j |
7 | k | k | k |
PRI | Col2 | col3 | col4 |
1 | b | b | b |
2 | c | c | c |
3 | d | d | d |
4 | e | e | e |
5 | g | g | g |
6 | h | h | h |
7 | j | j | j |
please ignore this table:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
basically, I want to remove duplicate PRIs, it does not matter ,at this point, what other columns have.
I wrote this:
DATA nodups;
set selfID3;
by PRI;
if (first.PRI and last.PRI) then output;
run;
but it seems that the code gives me this as output:
PRI | Col2 | col3 | col4 |
2 | c | c | c |
3 | d | d | d |
4 | e | e | e |
6 | h | h | h |
basically, it removes any duplicate records.
how can I fix my code?
Thanks,
Nazanin
what would be the difference between
Last.PRI
or
First.PRI
Thanks,
Nazanin
Yes, it can be done, just with a bit more programming fuss than using SQL:
data have ;
input ID A B C ;
cards ;
1 1 0 0
1 0 1 0
1 1 1 0
2 0 1 1
2 1 0 1
run ;
data want_by (drop = _:) ;
do until (last.ID) ;
set have ;
by ID ;
array vv A B C ;
array mm _m1-_m3 ;
do over vv ;
mm = mm <> vv ;
end ;
end ;
do over vv ;
vv = mm ;
end ;
run ;
However, as @Kurt_Bremser has pointed out, it requires HAVE to be sorted or grouped by ID. If it's not sorted, it still can be done but needs a hash table to accumulate the maxima - and still more programming fuss:
data _null_ ;
dcl hash max (hashexp:0) ;
max.definekey ("ID") ;
max.definedata ("ID", "A", "B", "C") ;
max.definedone () ;
do until (last) ;
set have (rename=(A=m1 B=m2 C=m3)) end = last
array vv A B C ;
array mm m1 - m3 ;
rc = max.find() ;
do over vv ;
if rc ne 0 then vv = mm ;
else vv = mm <> vv ;
end ;
max.replace() ;
end ;
max.output (dataset:"want") ;
run ;
Note that HASHEXP:0 directs the hash object to use a single (2**0=1) AVL tree. It has two effects:
(1) Keeps the table's base structure leaner (though it results in a slower search - but it's not a concern here).
(2) Results in the table being ordered by ID without coding the ORDERED:"A" argument tag. This is because with a single tree, the hash function has nothing else to point at, and the keys in an AVL tree are always in order.
HTH
Paul D.
Kudos to an SQL head. Methinks, though, that in this case proc MEANS is just the tool since with it, there's no need to list the variables:
proc means nway noprint data = have ;
class ID ;
output out = want (drop = _:) max= ;
run ;
Best
Paul D.
Hi,
Data step solution:
proc stdize data=have outstat=stat(where=(_type_='SCALE')) method=maxabs;
by id;
run;
data want(drop=_type_ i);
set stat;
array v(*) a b c;
do i=1 to dim(v);
if v(i) = . then v(i)=0;
end;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.