Solved: Re: Convert proc sort nodupkey to proc sql

lydiawawa · Posted 10-30-2019 06:02 PM

Hi,

I'm trying to convert proc sort nodupkey to proc sql with a counter of distinct rows defined by 2 variables.

Dataset:

Have

key mode time x y z

1 int 1 a b c

1 pho 2 d a b

2 int 1 c v a

3 pho 1 a b c

3 pho 2 s d e

Want

key mode time x y z

1 int 1 a b c

2 int 1 c v a

3 pho 1 a b c

For proc sort, I first sorted the dataset by key, mode and time

proc sort data = have; by key mode time; run;

Then to remove duplicates, I would like to take the earliest time, which has already been taken care of by the previous proc sort:

proc sort data = have out = want nodupkey; by key mode; run;

I need to convert this procedure to proc sql with a counter that counts the distinct combo of key and mode and produce the same output

This is what I have, but is doesn't generate same obs number as proc sort:

  proc sql;
  CREATE TABLE want AS
  SELECT *, COUNT(DISTINCT(key||mode)) AS counter FROM want GROUP BY key, mode;
quit;

The following will produce the same number of obs as proc sort, but it will only let me keep key and mode:

    proc sql;
  create table want as
  select key, mode, count(distinct key||mode) as counter from (select distinct * from have) group by key,mode; quit;

Grateful for any help!

art297 · Posted 10-30-2019 08:31 PM

The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.

The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:

data have;
  input key  mode $  time  (x  y z) ($);
  cards;
1       int        1    a  b  c
1       pho      1    d   a  b
1       pho      2    dd   aa  bb
2       int        1    c   v  a
3       pho      1   a   b  c
3       pho      1   aa   bb  cc
3       pho      2   aaa  bbb   ccc
;
run;

proc sql;
 create table need as
   select *, monotonic() as count
     from have
       group by key, mode, time
 ;
 create table want (drop=count) as
   select *
     from need
       group by key, mode
         having count=min(count)
 ;
quit;

Art, CEO, AnalystFinder.com

View solution in original post

NewSASPerson · Posted 10-30-2019 07:12 PM

Your proc sort and your first sql code gave me the same results but based on your want this is what I have. Let me know if this works for you

data hAVE;
input key$ mode $ time x $ y$ z $;
datalines;

1       int        1    a b c
1       pho      2    d   a b
2       int        1    c   v a
3       pho      1   a   b c
3       pho      2   s d   e

;
run;

proc sql;
create table want as
select
key,
mode,
time,
x,
y,
z,
count(*) as Counter
from have
group by key
having time = min(time)
;
quit;

lydiawawa · Posted 10-30-2019 08:26 PM

The raw dataset has over 50 variables. Therefore, I cannot select by variable names, I have to use select *. After I change the code into:

proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key, mode
having time = min(time)
;

quit;

Or

proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key
having time = min(time)
;
quit;

The unduped outcome has more observation than proc sort.

art297 · Posted 10-30-2019 08:31 PM

The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.

The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:

data have;
  input key  mode $  time  (x  y z) ($);
  cards;
1       int        1    a  b  c
1       pho      1    d   a  b
1       pho      2    dd   aa  bb
2       int        1    c   v  a
3       pho      1   a   b  c
3       pho      1   aa   bb  cc
3       pho      2   aaa  bbb   ccc
;
run;

proc sql;
 create table need as
   select *, monotonic() as count
     from have
       group by key, mode, time
 ;
 create table want (drop=count) as
   select *
     from need
       group by key, mode
         having count=min(count)
 ;
quit;

Art, CEO, AnalystFinder.com

lydiawawa · Posted 10-30-2019 08:38 PM

wow that worked like a magic.

lydiawawa · Posted 10-30-2019 08:44 PM

One more question: how do I know the unduped records take the minimum/earliest time in every key and mode group?

art297 · Posted 10-30-2019 08:55 PM

That\s why I created file NEED .. so that you could see which records had the minimum count for each group.

lydiawawa · Posted 10-30-2019 08:58 PM

Seems like monotonic() is very similar to _N_ and I think the order of time is probably taken care of by the group by statement?

art297 · Posted 10-31-2019 11:46 AM

Like I said, the monotonic() function isn't documented, so users have only guessed at how it works (see: https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752 ).

However, yes, I think the group statement would cause the function to work as you want it to. But, as it isn't documented, there are no guarantees.

Art, CEO, AnalystFinder.com

hashman · Posted 10-31-2019 03:02 PM

@lydiawawa: Proc SQL is not exactly a tool suitable for record unduplication, especially in the situations where key ties have to be resolved based on which tied record comes first. That is why a sequential subkey must be added to the input, as @art297 has shown in his post using the MONOTONIC() function, if such a situation should occur. In the input below, this kind of circumstance is represented by by record #1 tied by (key,mode,time) with record #2 added to the sample input I've pilfered from @art297.

But all of the above, plus adding a counter of the unique (key,mod) values, can be easily handled in a single step using the hash object:

data have ;                                                                                                                             
  input key mode $ time (x y z) ($) ;                                                                                                   
  cards;                                                                                                                                
1  int   1  a    b    c                                                                                                                 
1  int   1  a1   b1   c1                                                                                                                
1  pho   1  d    a    b                                                                                                                 
1  pho   2  dd   aa   bb                                                                                                                
2  int   1  c    v    a                                                                                                                 
3  pho   1  a    b    c                                                                                                                 
3  pho   1  aa   bb   cc                                                                                                                
3  pho   2  aaa  bbb  ccc                                                                                                               
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data _null_ ;                                                                                                                           
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (ordered:"a") ;                                                                                                          
    h.definekey ("key", "mode") ;                                                                                                       
    h.definedata ("key", "mode", "time", "x", "y", "z", "count") ;                                                                      
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have end = lr ;                                                                                                                   
  _t = time ;                                                                                                                           
  if h.find() ne 0 then count = 1 ;                                                                                                     
  else                  count + 1 ;                                                                                                     
  if _t < time then time = _t ;                                                                                                         
  h.replace() ;                                                                                                                         
  if lr then h.output (dataset:"want") ;                                                                                                
run ;

Between the records 1 and 2, which cannot be deduped based on the time variable, record 1 is selected because it comes physically first. The output then looks as follows:

key    mode    time    x    y    z    count                                                                                             
-------------------------------------------                                                                                             
 1     int       1     a    b    c      2                                                                                               
 1     pho       1     d    a    b      2                                                                                               
 2     int       1     c    v    a      1                                                                                               
 3     pho       1     a    b    c      3

Kind regards

Paul D.

lydiawawa · Posted 10-31-2019 03:09 PM

wow you guys are like saints. Thank you so much for helping me understand the ambiguities.

SAS Innovate 2025: Call for Content

Classroom Training Available!