topic What's the better way to flag highest and lowest observations using proc sql in SAS Procedures

What's the better way to flag highest and lowest observations using proc sql

kisumsam — Sat, 31 Dec 2016 19:00:02 GMT

Hello,

I'm trying to find a way to use proc sql to flag highest or lowest observations similar to the way we do it in data step:

proc sort data=sashelp.shoes out=shoes;
by region product sales;
run;

data shoes2;
set shoes;
by region product sales;
if first.product then flag = 1;
else if last.product then flag = 2;
run;

I have come up with this solution:

proc sql;
select *, min(sales) as min, max(sales) as max,
case 
	when sales = calculated min then 1
	when sales = calculated max then 2
	else .
end as flag
from sashelp.shoes
group by region, product
order by region, product, sales;
quit;

It does the job. I'm just wondering if there is a more efficient way to do these kind of flagging?

The proc sql code I have above requires reemerging and I'm not sure if this is the best way to do this task.

Can anyone offer any suggestions? thanks.

Re: What's the better way to flag highest and lowest observations using proc sql

LaurieF — Sat, 31 Dec 2016 20:39:31 GMT

This is one of those areas, as you are obviously aware, where a data step is superior: data steps think globally but acts locally; SQL only acts globally.

By extending your code to use a subjoin (or creating a summarised view) and effectively doing a manual re-merge, the optimiser will probably do a better job. By using the magic keyword and value of 103, I've forced it to do a hash join which, on a really big dataset may make things run faster.

proc sql magic=103;
create table firstlast as
   select shoes.*,
          sum.min, 
          sum.max,
          case
             when sales = sum.min 
                then 1
             when sales = sum.max
                then 2
             else .
             end as flag
     from sashelp.shoes as shoes
    inner join (select region,
                       product,
                       min(sales) as min,
                       max(sales) as max
                  from sashelp.shoes
                 group by region,
                          product) as sum
        on shoes.region = sum.region
       and shoes.product = sum.product
     order by shoes.region,
              shoes.product,
              shoes.sales;
quit;

Re: What's the better way to flag highest and lowest observations using proc sql

kisumsam — Sat, 31 Dec 2016 21:11:31 GMT

Thanks LaurieF. I tried using a similar join as well but the code seems to be overly complicated while not reducing much of the CPU/Real time.

Good to know data step is superior is this case. Thanks!

Re: What's the better way to flag highest and lowest observations using proc sql

LaurieF — Sat, 31 Dec 2016 21:15:34 GMT

Yeah - I thought as much. SAS SQL just doesn't have the facility to do it. I think you can do something similar in Oracle SQL, but it still requires a fair amount of code.

Re: What's the better way to flag highest and lowest observations using proc sql

Reeza — Sun, 01 Jan 2017 00:54:23 GMT

If you have pass through access to your database, the latest versions of SQL (ORACLE/DB2) have either cursor logic or WINDOW/PIVOT technology that helps with this somewhat. It's still more code and I don't find it as clear as a data step.

Re: What's the better way to flag highest and lowest observations using proc sql

mkeintz — Sun, 01 Jan 2017 22:49:23 GMT

If it's a big flle and you use the data step approach, I would PROC SUMMARY instead of PROC SORT - much less input/output, ... and it preserves the original record order. Then use a hash lookup to set flag values:

proc summary data=sashelp.shoes (keep=region product sales) noprint nway;
  class region product;
  output out=need (drop=_type_ _freq_) min(sales)=minsale max(sales)=maxsale;
run;

data shoes2 (drop=rc minsale maxsale);
  set sashelp.shoes end=eod;
  flag=.;
  if _n_=1 then do;
    if 0 then set need;
    declare hash h (dataset:'need' );
      h.definekey('region','product');
      h.definedata('minsale','maxsale');
      h.definedone();
  end;
  rc=h.find();
  if sales=minsale then flag=1;
  else if sales=maxsale then flag=2;
run;

Re: What's the better way to flag highest and lowest observations using proc sql

LaurieF — Mon, 02 Jan 2017 02:18:34 GMT

Oh very good - I do like a good hash table lookup.

Bear in mind for this solution, and the SQL method: if two products have the same sales value, the flag will be set incorrectly. The only easy way to be absolutely sure is to use first/last.

And also note: the original point was to find out if there was an SQL method for first/last processing, not whether there was another way of doing it.

Re: What's the better way to flag highest and lowest observations using proc sql

LaurieF — Mon, 02 Jan 2017 21:12:48 GMT

I'd like to revise my original suggestion, in light of the possibility of there being multiple sales values for a region/product, by using monotonic() (after a sort!).

proc sql undo_policy=none magic=103;
create table shoes as
   select *
     from sashelp.shoes
    order by region,
             product,
             sales;
create table shoes as
   select *,
          monotonic() as identifier
     from shoes;
create table shoes(drop=identifier) as
   select shoes.*,
          case
             when shoes.identifier = sum.min_id 
                then 1
             when shoes.identifier = sum.max_id
                then 2
             else .
             end as flag length=3
     from shoes
    inner join (select region,
                       product,
                       min(identifier) as min_id,
                       max(identifier) as max_id
                  from shoes
                 group by region,
                          product) as sum
       on shoes.region = sum.region
      and shoes.product = sum.product;
quit;

Again, this requires multiple leaps through hoops, and is ultimately unsatisfactory. Data steps FTW, as the young things say.

Re: What's the better way to flag highest and lowest observations using proc sql

kisumsam — Mon, 02 Jan 2017 23:48:38 GMT

Thanks all for the wonderful suggestions! The hash table looks good but as LaurieF said it might not get the correct results. Looks like the first/last data step method is indeed the best.

Re: What's the better way to flag highest and lowest observations using proc sql

mkeintz — Tue, 03 Jan 2017 05:37:19 GMT

Is the first./last. method as submitted REALLY the correct result? Do you mean you want to ignore ties at the minimum or maximum, and flag only one record for each? If so, what rule should be used to select which of the tied records gets flagged? The usual PROC SORT (with default option EQUALS) result would preserve original order among ties. So the first MIN and last MAX would be flagged. Is that the desired behavior?

Also, to be more general, you might want two flag variables instead of one. That would allow you to accomoate cases in which there is only one record in a group (i.e it's both min and max). The first./last. program assigns a flag=1 and never assigns a flag=2 for such a case.

Re: What's the better way to flag highest and lowest observations using proc sql

kisumsam — Tue, 03 Jan 2017 20:17:55 GMT

Thanks mkeintz. Yes, I was aware that the program isn't ideal when there is only 1 observation in the BY group. I just wanted to find out if there is a more efficient way to achieve the same goal with proc sql.

Re: What's the better way to flag highest and lowest observations using proc sql

mkeintz — Tue, 03 Jan 2017 22:39:24 GMT

@LaurieF

I'm confused. I do not understand why it is that "if two products have the same sales value, the flag will be set incorrectly". The hash object proposal generates independent min's and max's for each product, regardless of whether they are identically distributed. Aside from the issue of whether to flag ties at max or at min WITHIN a region/product, I don't see how it would be different than the PROC-SORT-FOLLOWED-BY FIRST./LAST. technique.