- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I'm trying to convert proc sort nodupkey to proc sql with a counter of distinct rows defined by 2 variables.
Dataset:
Have
key mode time x y z
1 int 1 a b c
1 pho 2 d a b
2 int 1 c v a
3 pho 1 a b c
3 pho 2 s d e
Want
key mode time x y z
1 int 1 a b c
2 int 1 c v a
3 pho 1 a b c
For proc sort, I first sorted the dataset by key, mode and time
proc sort data = have; by key mode time; run;
Then to remove duplicates, I would like to take the earliest time, which has already been taken care of by the previous proc sort:
proc sort data = have out = want nodupkey; by key mode; run;
I need to convert this procedure to proc sql with a counter that counts the distinct combo of key and mode and produce the same output
This is what I have, but is doesn't generate same obs number as proc sort:
proc sql;
CREATE TABLE want AS
SELECT *, COUNT(DISTINCT(key||mode)) AS counter FROM want GROUP BY key, mode;
quit;
The following will produce the same number of obs as proc sort, but it will only let me keep key and mode:
proc sql;
create table want as
select key, mode, count(distinct key||mode) as counter from (select distinct * from have) group by key,mode; quit;
Grateful for any help!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.
The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:
data have;
input key mode $ time (x y z) ($);
cards;
1 int 1 a b c
1 pho 1 d a b
1 pho 2 dd aa bb
2 int 1 c v a
3 pho 1 a b c
3 pho 1 aa bb cc
3 pho 2 aaa bbb ccc
;
run;
proc sql;
create table need as
select *, monotonic() as count
from have
group by key, mode, time
;
create table want (drop=count) as
select *
from need
group by key, mode
having count=min(count)
;
quit;
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your proc sort and your first sql code gave me the same results but based on your want this is what I have. Let me know if this works for you
data hAVE;
input key$ mode $ time x $ y$ z $;
datalines;
1 pho 2 d a b
2 int 1 c v a
3 pho 1 a b c
3 pho 2 s d e
;
run;
create table want as
select
key,
mode,
time,
x,
y,
z,
count(*) as Counter
from have
group by key
having time = min(time)
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The raw dataset has over 50 variables. Therefore, I cannot select by variable names, I have to use select *. After I change the code into:
proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key, mode
having time = min(time)
;
quit;
Or
proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key
having time = min(time)
;
quit;
The unduped outcome has more observation than proc sort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.
The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:
data have;
input key mode $ time (x y z) ($);
cards;
1 int 1 a b c
1 pho 1 d a b
1 pho 2 dd aa bb
2 int 1 c v a
3 pho 1 a b c
3 pho 1 aa bb cc
3 pho 2 aaa bbb ccc
;
run;
proc sql;
create table need as
select *, monotonic() as count
from have
group by key, mode, time
;
create table want (drop=count) as
select *
from need
group by key, mode
having count=min(count)
;
quit;
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That\s why I created file NEED .. so that you could see which records had the minimum count for each group.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Like I said, the monotonic() function isn't documented, so users have only guessed at how it works (see: https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752 ).
However, yes, I think the group statement would cause the function to work as you want it to. But, as it isn't documented, there are no guarantees.
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@lydiawawa: Proc SQL is not exactly a tool suitable for record unduplication, especially in the situations where key ties have to be resolved based on which tied record comes first. That is why a sequential subkey must be added to the input, as @art297 has shown in his post using the MONOTONIC() function, if such a situation should occur. In the input below, this kind of circumstance is represented by by record #1 tied by (key,mode,time) with record #2 added to the sample input I've pilfered from @art297.
But all of the above, plus adding a counter of the unique (key,mod) values, can be easily handled in a single step using the hash object:
data have ;
input key mode $ time (x y z) ($) ;
cards;
1 int 1 a b c
1 int 1 a1 b1 c1
1 pho 1 d a b
1 pho 2 dd aa bb
2 int 1 c v a
3 pho 1 a b c
3 pho 1 aa bb cc
3 pho 2 aaa bbb ccc
;
run ;
data _null_ ;
if _n_ = 1 then do ;
dcl hash h (ordered:"a") ;
h.definekey ("key", "mode") ;
h.definedata ("key", "mode", "time", "x", "y", "z", "count") ;
h.definedone () ;
end ;
set have end = lr ;
_t = time ;
if h.find() ne 0 then count = 1 ;
else count + 1 ;
if _t < time then time = _t ;
h.replace() ;
if lr then h.output (dataset:"want") ;
run ;
Between the records 1 and 2, which cannot be deduped based on the time variable, record 1 is selected because it comes physically first. The output then looks as follows:
key mode time x y z count ------------------------------------------- 1 int 1 a b c 2 1 pho 1 d a b 2 2 int 1 c v a 1 3 pho 1 a b c 3
Kind regards
Paul D.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content