Solved: What's the better way to flag highest and lowest observations using pr...

kisumsam · Posted 12-31-2016 02:00 PM

Hello,

I'm trying to find a way to use proc sql to flag highest or lowest observations similar to the way we do it in data step:

proc sort data=sashelp.shoes out=shoes;
by region product sales;
run;

data shoes2;
set shoes;
by region product sales;
if first.product then flag = 1;
else if last.product then flag = 2;
run;

I have come up with this solution:

proc sql;
select *, min(sales) as min, max(sales) as max,
case 
	when sales = calculated min then 1
	when sales = calculated max then 2
	else .
end as flag
from sashelp.shoes
group by region, product
order by region, product, sales;
quit;

It does the job. I'm just wondering if there is a more efficient way to do these kind of flagging?

The proc sql code I have above requires reemerging and I'm not sure if this is the best way to do this task.

Can anyone offer any suggestions? thanks.

LaurieF · Posted 01-02-2017 04:12 PM

I'd like to revise my original suggestion, in light of the possibility of there being multiple sales values for a region/product, by using monotonic() (after a sort!).

proc sql undo_policy=none magic=103;
create table shoes as
   select *
     from sashelp.shoes
    order by region,
             product,
             sales;
create table shoes as
   select *,
          monotonic() as identifier
     from shoes;
create table shoes(drop=identifier) as
   select shoes.*,
          case
             when shoes.identifier = sum.min_id 
                then 1
             when shoes.identifier = sum.max_id
                then 2
             else .
             end as flag length=3
     from shoes
    inner join (select region,
                       product,
                       min(identifier) as min_id,
                       max(identifier) as max_id
                  from shoes
                 group by region,
                          product) as sum
       on shoes.region = sum.region
      and shoes.product = sum.product;
quit;

Again, this requires multiple leaps through hoops, and is ultimately unsatisfactory. Data steps FTW, as the young things say.

View solution in original post

LaurieF · Posted 12-31-2016 03:38 PM

This is one of those areas, as you are obviously aware, where a data step is superior: data steps think globally but acts locally; SQL only acts globally.

By extending your code to use a subjoin (or creating a summarised view) and effectively doing a manual re-merge, the optimiser will probably do a better job. By using the magic keyword and value of 103, I've forced it to do a hash join which, on a really big dataset may make things run faster.

proc sql magic=103;
create table firstlast as
   select shoes.*,
          sum.min, 
          sum.max,
          case
             when sales = sum.min 
                then 1
             when sales = sum.max
                then 2
             else .
             end as flag
     from sashelp.shoes as shoes
    inner join (select region,
                       product,
                       min(sales) as min,
                       max(sales) as max
                  from sashelp.shoes
                 group by region,
                          product) as sum
        on shoes.region = sum.region
       and shoes.product = sum.product
     order by shoes.region,
              shoes.product,
              shoes.sales;
quit;

kisumsam · Posted 12-31-2016 04:11 PM

Thanks LaurieF. I tried using a similar join as well but the code seems to be overly complicated while not reducing much of the CPU/Real time.

Good to know data step is superior is this case. Thanks!

LaurieF · Posted 12-31-2016 04:15 PM

Yeah - I thought as much. SAS SQL just doesn't have the facility to do it. I think you can do something similar in Oracle SQL, but it still requires a fair amount of code.

Reeza · Posted 12-31-2016 07:54 PM

If you have pass through access to your database, the latest versions of SQL (ORACLE/DB2) have either cursor logic or WINDOW/PIVOT technology that helps with this somewhat. It's still more code and I don't find it as clear as a data step.

mkeintz · Posted 01-01-2017 05:49 PM

If it's a big flle and you use the data step approach, I would PROC SUMMARY instead of PROC SORT - much less input/output, ... and it preserves the original record order. Then use a hash lookup to set flag values:

proc summary data=sashelp.shoes (keep=region product sales) noprint nway;
  class region product;
  output out=need (drop=_type_ _freq_) min(sales)=minsale max(sales)=maxsale;
run;

data shoes2 (drop=rc minsale maxsale);
  set sashelp.shoes end=eod;
  flag=.;
  if _n_=1 then do;
    if 0 then set need;
    declare hash h (dataset:'need' );
      h.definekey('region','product');
      h.definedata('minsale','maxsale');
      h.definedone();
  end;
  rc=h.find();
  if sales=minsale then flag=1;
  else if sales=maxsale then flag=2;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

LaurieF · Posted 01-01-2017 06:29 PM

Oh very good - I do like a good hash table lookup.

Bear in mind for this solution, and the SQL method: if two products have the same sales value, the flag will be set incorrectly. The only easy way to be absolutely sure is to use first/last.

And also note: the original point was to find out if there was an SQL method for first/last processing, not whether there was another way of doing it.

LaurieF · Posted 01-02-2017 04:12 PM

I'd like to revise my original suggestion, in light of the possibility of there being multiple sales values for a region/product, by using monotonic() (after a sort!).

proc sql undo_policy=none magic=103;
create table shoes as
   select *
     from sashelp.shoes
    order by region,
             product,
             sales;
create table shoes as
   select *,
          monotonic() as identifier
     from shoes;
create table shoes(drop=identifier) as
   select shoes.*,
          case
             when shoes.identifier = sum.min_id 
                then 1
             when shoes.identifier = sum.max_id
                then 2
             else .
             end as flag length=3
     from shoes
    inner join (select region,
                       product,
                       min(identifier) as min_id,
                       max(identifier) as max_id
                  from shoes
                 group by region,
                          product) as sum
       on shoes.region = sum.region
      and shoes.product = sum.product;
quit;

Again, this requires multiple leaps through hoops, and is ultimately unsatisfactory. Data steps FTW, as the young things say.

kisumsam · Posted 01-02-2017 06:48 PM

Thanks all for the wonderful suggestions! The hash table looks good but as LaurieF said it might not get the correct results. Looks like the first/last data step method is indeed the best.

mkeintz · Posted 01-03-2017 12:37 AM

Is the first./last. method as submitted REALLY the correct result? Do you mean you want to ignore ties at the minimum or maximum, and flag only one record for each? If so, what rule should be used to select which of the tied records gets flagged? The usual PROC SORT (with default option EQUALS) result would preserve original order among ties. So the first MIN and last MAX would be flagged. Is that the desired behavior?

Also, to be more general, you might want two flag variables instead of one. That would allow you to accomoate cases in which there is only one record in a group (i.e it's both min and max). The first./last. program assigns a flag=1 and never assigns a flag=2 for such a case.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

kisumsam · Posted 01-03-2017 03:17 PM

Thanks mkeintz. Yes, I was aware that the program isn't ideal when there is only 1 observation in the BY group. I just wanted to find out if there is a more efficient way to achieve the same goal with proc sql.

mkeintz · Posted 01-03-2017 05:39 PM

@LaurieF

I'm confused. I do not understand why it is that "if two products have the same sales value, the flag will be set incorrectly". The hash object proposal generates independent min's and max's for each product, regardless of whether they are identically distributed. Aside from the issue of whether to flag ties at max or at min WITHIN a region/product, I don't see how it would be different than the PROC-SORT-FOLLOWED-BY FIRST./LAST. technique.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

Re: What's the better way to flag highest and lowest observations using proc sql

SAS Innovate 2025: Call for Content

Classroom Training Available!