BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lydiawawa
Lapis Lazuli | Level 10

Hi,

 

I'm trying to convert proc sort nodupkey to proc sql with a counter of distinct rows defined by 2 variables.

 

Dataset:

Have

key  mode  time  x  y z

1       int        1    a  b  c

1       pho      2    d   a  b

2       int        1    c   v  a

3       pho      1   a   b  c

3       pho      2   s  d   e

 

Want

key  mode  time  x  y z

1       int        1    a  b  c

2       int        1    c   v  a

3       pho      1   a   b  c

 

 

For proc sort, I first sorted the dataset by key, mode and time

proc sort data = have; by key mode time; run;

Then to remove duplicates, I would like to take the earliest time, which has already been taken care of by the previous proc sort:

proc sort data = have out = want nodupkey; by key mode; run;

I need to convert this procedure to proc sql with a counter that counts the distinct combo of key and mode and produce the same output

 

This is what I have, but is doesn't generate same obs number as proc sort:

  proc sql;
  CREATE TABLE want AS
  SELECT *, COUNT(DISTINCT(key||mode)) AS counter FROM want GROUP BY key, mode;
quit;

The following will produce the same number of obs as proc sort, but it will only let me keep key and mode:

 

    proc sql;
  create table want as
  select key, mode, count(distinct key||mode) as counter from (select distinct * from have) group by key,mode; quit;

Grateful for any help!

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.

 

The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:

data have;
  input key  mode $  time  (x  y z) ($);
  cards;
1       int        1    a  b  c
1       pho      1    d   a  b
1       pho      2    dd   aa  bb
2       int        1    c   v  a
3       pho      1   a   b  c
3       pho      1   aa   bb  cc
3       pho      2   aaa  bbb   ccc
;
run;

proc sql;
 create table need as
   select *, monotonic() as count
     from have
       group by key, mode, time
 ;
 create table want (drop=count) as
   select *
     from need
       group by key, mode
         having count=min(count)
 ;
quit;

Art, CEO, AnalystFinder.com

 

View solution in original post

10 REPLIES 10
NewSASPerson
Quartz | Level 8

Your proc sort and your first sql code gave me the same results but based on your want this is what I have. Let me know if this works for you

 


data hAVE;
input key$  mode $  time  x $  y$ z $;
datalines;
1       int        1    a  b  c
1       pho      2    d   a  b
2       int        1    c   v  a
3       pho      1   a   b  c
3       pho      2   s  d   e
 
;
run;
proc sql;
create table want as
select
key,
mode,
time,
x,
y,
z,
count(*) as Counter
from have
group by key
having time = min(time)
;
quit;
lydiawawa
Lapis Lazuli | Level 10

The raw dataset has over 50 variables. Therefore, I cannot select by variable names, I have to use select *. After I change the code into:

 

proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key, mode
having time = min(time)
;
quit;

Or

 

proc sql;
create table want as
select
*,
count(*) as Counter
from have
group by key
having time = min(time)
;
quit;

 

 

The unduped outcome has more observation than proc sort.

art297
Opal | Level 21

The WANT you indicated in your initial post didn't match the one you would get from the two proc sorts you showed.

 

The following will, but uses an undocumented function, namely monotonic(), so I wouldn't suggest using it if this is production code. Also, while proc sql appears to read data sequentially, by definition there is no guarantee that it will always work that way:

data have;
  input key  mode $  time  (x  y z) ($);
  cards;
1       int        1    a  b  c
1       pho      1    d   a  b
1       pho      2    dd   aa  bb
2       int        1    c   v  a
3       pho      1   a   b  c
3       pho      1   aa   bb  cc
3       pho      2   aaa  bbb   ccc
;
run;

proc sql;
 create table need as
   select *, monotonic() as count
     from have
       group by key, mode, time
 ;
 create table want (drop=count) as
   select *
     from need
       group by key, mode
         having count=min(count)
 ;
quit;

Art, CEO, AnalystFinder.com

 

lydiawawa
Lapis Lazuli | Level 10
wow that worked like a magic.
lydiawawa
Lapis Lazuli | Level 10
One more question: how do I know the unduped records take the minimum/earliest time in every key and mode group?
art297
Opal | Level 21

That\s why I created file NEED .. so that you could see which records had the minimum count for each group.

 

lydiawawa
Lapis Lazuli | Level 10
Seems like monotonic() is very similar to _N_ and I think the order of time is probably taken care of by the group by statement?
art297
Opal | Level 21

Like I said, the monotonic() function isn't documented, so users have only guessed at how it works (see: https://communities.sas.com/t5/SAS-Communities-Library/MONOTONIC-function-in-PROC-SQL/ta-p/475752 ).

 

However, yes, I think the group statement would cause the function to work as you want it to. But, as it isn't documented, there are no guarantees.

 

Art, CEO, AnalystFinder.com

 

hashman
Ammonite | Level 13

@lydiawawa: Proc SQL is not exactly a tool suitable for record unduplication, especially in the situations where key ties have to be resolved based on which tied record comes first. That is why a sequential subkey must be added to the input, as @art297 has shown in his post using the MONOTONIC() function, if such a situation should occur. In the input below, this kind of circumstance is represented by by record #1 tied by (key,mode,time) with record #2 added to the sample input I've pilfered from @art297.

 

But all of the above, plus adding a counter of the unique (key,mod) values, can be easily handled in a single step using the hash object:

data have ;                                                                                                                             
  input key mode $ time (x y z) ($) ;                                                                                                   
  cards;                                                                                                                                
1  int   1  a    b    c                                                                                                                 
1  int   1  a1   b1   c1                                                                                                                
1  pho   1  d    a    b                                                                                                                 
1  pho   2  dd   aa   bb                                                                                                                
2  int   1  c    v    a                                                                                                                 
3  pho   1  a    b    c                                                                                                                 
3  pho   1  aa   bb   cc                                                                                                                
3  pho   2  aaa  bbb  ccc                                                                                                               
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data _null_ ;                                                                                                                           
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h (ordered:"a") ;                                                                                                          
    h.definekey ("key", "mode") ;                                                                                                       
    h.definedata ("key", "mode", "time", "x", "y", "z", "count") ;                                                                      
    h.definedone () ;                                                                                                                   
  end ;                                                                                                                                 
  set have end = lr ;                                                                                                                   
  _t = time ;                                                                                                                           
  if h.find() ne 0 then count = 1 ;                                                                                                     
  else                  count + 1 ;                                                                                                     
  if _t < time then time = _t ;                                                                                                         
  h.replace() ;                                                                                                                         
  if lr then h.output (dataset:"want") ;                                                                                                
run ;                  

Between the records 1 and 2, which cannot be deduped based on the time variable, record 1 is selected because it comes physically first. The output then looks as follows:

key    mode    time    x    y    z    count                                                                                             
-------------------------------------------                                                                                             
 1     int       1     a    b    c      2                                                                                               
 1     pho       1     d    a    b      2                                                                                               
 2     int       1     c    v    a      1                                                                                               
 3     pho       1     a    b    c      3 

Kind regards

Paul D.         

lydiawawa
Lapis Lazuli | Level 10
wow you guys are like saints. Thank you so much for helping me understand the ambiguities.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 6119 views
  • 2 likes
  • 4 in conversation