Hi,
I'm trying to convert proc sort nodupkey to proc sql with a counter of distinct rows defined by 2 variables.
Dataset:
Have
key mode time x y z
1 int 1 a b c
1 pho 2 d a b
2 int 1 c v a
3 pho 1 a b c
3 pho 2 s d e
Want
key mode time x y z
1 int 1 a b c
2 int 1 c v a
3 pho 1 a b c
For proc sort, I first sorted the dataset by key, mode and time
proc sort data = have; by key mode time; run;
Then to remove duplicates, I would like to take the earliest time, which has already been taken care of by the previous proc sort:
proc sort data = have out = want nodupkey; by key mode; run;
I need to convert this procedure to proc sql with a counter that counts the distinct combo of key and mode and produce the same output
This is what I have, but is doesn't generate same obs number as proc sort:
proc sql;
CREATE TABLE want AS
SELECT *, COUNT(DISTINCT(key||mode)) AS counter FROM want GROUP BY key, mode;
quit;
The following will produce the same number of obs as proc sort, but it will only let me keep key and mode:
proc sql;
create table want as
select key, mode, count(distinct key||mode) as counter from (select distinct * from have) group by key,mode; quit;
Grateful for any help!
The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.
The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:
data have;
input key mode $ time (x y z) ($);
cards;
1 int 1 a b c
1 pho 1 d a b
1 pho 2 dd aa bb
2 int 1 c v a
3 pho 1 a b c
3 pho 1 aa bb cc
3 pho 2 aaa bbb ccc
;
run;
proc sql;
create table need as
select *, monotonic() as count
from have
group by key, mode, time
;
create table want (drop=count) as
select *
from need
group by key, mode
having count=min(count)
;
quit;
Art, CEO, AnalystFinder.com
Your proc sort and your first sql code gave me the same results but based on your want this is what I have. Let me know if this works for you
The raw dataset has over 50 variables. Therefore, I cannot select by variable names, I have to use select *. After I change the code into:
proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key, mode
having time = min(time)
;
quit;
Or
proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key
having time = min(time)
;
quit;
The unduped outcome has more observation than proc sort.
The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.
The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:
data have;
input key mode $ time (x y z) ($);
cards;
1 int 1 a b c
1 pho 1 d a b
1 pho 2 dd aa bb
2 int 1 c v a
3 pho 1 a b c
3 pho 1 aa bb cc
3 pho 2 aaa bbb ccc
;
run;
proc sql;
create table need as
select *, monotonic() as count
from have
group by key, mode, time
;
create table want (drop=count) as
select *
from need
group by key, mode
having count=min(count)
;
quit;
Art, CEO, AnalystFinder.com
That\s why I created file NEED .. so that you could see which records had the minimum count for each group.
Like I said, the monotonic() function isn't documented, so users have only guessed at how it works (see: https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752 ).
However, yes, I think the group statement would cause the function to work as you want it to. But, as it isn't documented, there are no guarantees.
Art, CEO, AnalystFinder.com
@lydiawawa: Proc SQL is not exactly a tool suitable for record unduplication, especially in the situations where key ties have to be resolved based on which tied record comes first. That is why a sequential subkey must be added to the input, as @art297 has shown in his post using the MONOTONIC() function, if such a situation should occur. In the input below, this kind of circumstance is represented by by record #1 tied by (key,mode,time) with record #2 added to the sample input I've pilfered from @art297.
But all of the above, plus adding a counter of the unique (key,mod) values, can be easily handled in a single step using the hash object:
data have ;
input key mode $ time (x y z) ($) ;
cards;
1 int 1 a b c
1 int 1 a1 b1 c1
1 pho 1 d a b
1 pho 2 dd aa bb
2 int 1 c v a
3 pho 1 a b c
3 pho 1 aa bb cc
3 pho 2 aaa bbb ccc
;
run ;
data _null_ ;
if _n_ = 1 then do ;
dcl hash h (ordered:"a") ;
h.definekey ("key", "mode") ;
h.definedata ("key", "mode", "time", "x", "y", "z", "count") ;
h.definedone () ;
end ;
set have end = lr ;
_t = time ;
if h.find() ne 0 then count = 1 ;
else count + 1 ;
if _t < time then time = _t ;
h.replace() ;
if lr then h.output (dataset:"want") ;
run ;
Between the records 1 and 2, which cannot be deduped based on the time variable, record 1 is selected because it comes physically first. The output then looks as follows:
key mode time x y z count ------------------------------------------- 1 int 1 a b c 2 1 pho 1 d a b 2 2 int 1 c v a 1 3 pho 1 a b c 3
Kind regards
Paul D.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.