BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kisumsam
Quartz | Level 8

Hello,

 

I'm trying to find a way to use proc sql to flag highest or lowest observations similar to the way we do it in data step:

 

proc sort data=sashelp.shoes out=shoes;
by region product sales;
run;

data shoes2;
set shoes;
by region product sales;
if first.product then flag = 1;
else if last.product then flag = 2;
run;

I have come up with this solution:

 

proc sql;
select *, min(sales) as min, max(sales) as max,
case 
	when sales = calculated min then 1
	when sales = calculated max then 2
	else .
end as flag
from sashelp.shoes
group by region, product
order by region, product, sales;
quit;

It does the job. I'm just wondering if there is a more efficient way to do these kind of flagging?

 

The proc sql code I have above requires reemerging and I'm not sure if this is the best way to do this task.

 

Can anyone offer any suggestions? thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
LaurieF
Barite | Level 11

I'd like to revise my original suggestion, in light of the possibility of there being multiple sales values for a region/product, by using monotonic() (after a sort!).

 

proc sql undo_policy=none magic=103;
create table shoes as
   select *
     from sashelp.shoes
    order by region,
             product,
             sales;
create table shoes as
   select *,
          monotonic() as identifier
     from shoes;
create table shoes(drop=identifier) as
   select shoes.*,
          case
             when shoes.identifier = sum.min_id 
                then 1
             when shoes.identifier = sum.max_id
                then 2
             else .
             end as flag length=3
     from shoes
    inner join (select region,
                       product,
                       min(identifier) as min_id,
                       max(identifier) as max_id
                  from shoes
                 group by region,
                          product) as sum
       on shoes.region = sum.region
      and shoes.product = sum.product;
quit;

Again, this requires multiple leaps through hoops, and is ultimately unsatisfactory. Data steps FTW, as the young things say.

View solution in original post

11 REPLIES 11
LaurieF
Barite | Level 11

This is one of those areas, as you are obviously aware, where a data step is superior: data steps think globally but acts locally; SQL only acts globally.

 

By extending your code to use a subjoin (or creating a summarised view) and effectively doing a manual re-merge, the optimiser will probably do a better job. By using the magic keyword and value of 103, I've forced it to do a hash join which, on a really big dataset may make things run faster.

 

proc sql magic=103;
create table firstlast as
   select shoes.*,
          sum.min, 
          sum.max,
          case
             when sales = sum.min 
                then 1
             when sales = sum.max
                then 2
             else .
             end as flag
     from sashelp.shoes as shoes
    inner join (select region,
                       product,
                       min(sales) as min,
                       max(sales) as max
                  from sashelp.shoes
                 group by region,
                          product) as sum
        on shoes.region = sum.region
       and shoes.product = sum.product
     order by shoes.region,
              shoes.product,
              shoes.sales;
quit;

 

kisumsam
Quartz | Level 8

Thanks LaurieF. I tried using a similar join as well but the code seems to be overly complicated while not reducing much of the CPU/Real time. 

 

Good to know data step is superior is this case. Thanks!

LaurieF
Barite | Level 11

Yeah - I thought as much. SAS SQL just doesn't have the facility to do it. I think you can do something similar in Oracle SQL, but it still requires a fair amount of code. 

Reeza
Super User

If you have pass through access to your database, the latest versions of SQL (ORACLE/DB2) have either cursor logic or WINDOW/PIVOT technology that helps with this somewhat. It's still more code and I don't find it as clear as a data step. 

mkeintz
PROC Star

If it's a big flle and you use the data step approach, I would PROC SUMMARY instead of PROC SORT - much less input/output, ... and it preserves the original record order.  Then use a hash lookup to set flag values:

 

proc summary data=sashelp.shoes (keep=region product sales) noprint nway;
  class region product;
  output out=need (drop=_type_ _freq_) min(sales)=minsale max(sales)=maxsale;
run;

data shoes2 (drop=rc minsale maxsale);
  set sashelp.shoes end=eod;
  flag=.;
  if _n_=1 then do;
    if 0 then set need;
    declare hash h (dataset:'need' );
      h.definekey('region','product');
      h.definedata('minsale','maxsale');
      h.definedone();
  end;
  rc=h.find();
  if sales=minsale then flag=1;
  else if sales=maxsale then flag=2;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
LaurieF
Barite | Level 11

Oh very good - I do like a good hash table lookup.

 

 

Bear in mind for this solution, and the SQL method: if two products have the same sales value, the flag will be set incorrectly. The only easy way to be absolutely sure is to use first/last.

 

And also note: the original point was to find out if there was an SQL method for first/last processing, not whether there was another way of doing it.

LaurieF
Barite | Level 11

I'd like to revise my original suggestion, in light of the possibility of there being multiple sales values for a region/product, by using monotonic() (after a sort!).

 

proc sql undo_policy=none magic=103;
create table shoes as
   select *
     from sashelp.shoes
    order by region,
             product,
             sales;
create table shoes as
   select *,
          monotonic() as identifier
     from shoes;
create table shoes(drop=identifier) as
   select shoes.*,
          case
             when shoes.identifier = sum.min_id 
                then 1
             when shoes.identifier = sum.max_id
                then 2
             else .
             end as flag length=3
     from shoes
    inner join (select region,
                       product,
                       min(identifier) as min_id,
                       max(identifier) as max_id
                  from shoes
                 group by region,
                          product) as sum
       on shoes.region = sum.region
      and shoes.product = sum.product;
quit;

Again, this requires multiple leaps through hoops, and is ultimately unsatisfactory. Data steps FTW, as the young things say.

kisumsam
Quartz | Level 8

Thanks all for the wonderful suggestions! The hash table looks good but as LaurieF said it might not get the correct results. Looks like the first/last data step method is indeed the best.

 

 

mkeintz
PROC Star

Is the first./last. method as submitted REALLY the correct result?  Do you mean you want to ignore ties at the minimum or maximum, and flag only one record for each?  If so, what rule should be used to select which of the tied records gets flagged?  The usual  PROC SORT (with default option EQUALS) result would preserve original order among ties.  So the first MIN and last MAX would be flagged.  Is that the desired behavior?

 

Also, to be more general, you might want two flag variables instead of one.  That would allow you to accomoate cases in which there is only one record in a group  (i.e it's both min and max).  The first./last. program assigns a flag=1 and never assigns a flag=2 for such a case. 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
kisumsam
Quartz | Level 8

Thanks mkeintz. Yes, I was aware that the program isn't ideal when there is only 1 observation in the BY group. I just wanted to find out if there is a more efficient way to achieve the same goal with proc sql.

 

 

mkeintz
PROC Star

@LaurieF

 

I'm confused.  I do not understand why it is that "if two products have the same sales value, the flag will be set incorrectly".  The hash object proposal generates independent min's and max's for each product, regardless of whether they are identically distributed.  Aside from the issue of whether to flag ties at max or at min WITHIN a region/product, I don't see how it would be different than the PROC-SORT-FOLLOWED-BY FIRST./LAST. technique.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 6822 views
  • 10 likes
  • 4 in conversation