Hello,
I'm trying to find a way to use proc sql to flag highest or lowest observations similar to the way we do it in data step:
proc sort data=sashelp.shoes out=shoes;
by region product sales;
run;
data shoes2;
set shoes;
by region product sales;
if first.product then flag = 1;
else if last.product then flag = 2;
run;
I have come up with this solution:
proc sql;
select *, min(sales) as min, max(sales) as max,
case
when sales = calculated min then 1
when sales = calculated max then 2
else .
end as flag
from sashelp.shoes
group by region, product
order by region, product, sales;
quit;
It does the job. I'm just wondering if there is a more efficient way to do these kind of flagging?
The proc sql code I have above requires reemerging and I'm not sure if this is the best way to do this task.
Can anyone offer any suggestions? thanks.
I'd like to revise my original suggestion, in light of the possibility of there being multiple sales values for a region/product, by using monotonic() (after a sort!).
proc sql undo_policy=none magic=103;
create table shoes as
select *
from sashelp.shoes
order by region,
product,
sales;
create table shoes as
select *,
monotonic() as identifier
from shoes;
create table shoes(drop=identifier) as
select shoes.*,
case
when shoes.identifier = sum.min_id
then 1
when shoes.identifier = sum.max_id
then 2
else .
end as flag length=3
from shoes
inner join (select region,
product,
min(identifier) as min_id,
max(identifier) as max_id
from shoes
group by region,
product) as sum
on shoes.region = sum.region
and shoes.product = sum.product;
quit;
Again, this requires multiple leaps through hoops, and is ultimately unsatisfactory. Data steps FTW, as the young things say.
This is one of those areas, as you are obviously aware, where a data step is superior: data steps think globally but acts locally; SQL only acts globally.
By extending your code to use a subjoin (or creating a summarised view) and effectively doing a manual re-merge, the optimiser will probably do a better job. By using the magic keyword and value of 103, I've forced it to do a hash join which, on a really big dataset may make things run faster.
proc sql magic=103;
create table firstlast as
select shoes.*,
sum.min,
sum.max,
case
when sales = sum.min
then 1
when sales = sum.max
then 2
else .
end as flag
from sashelp.shoes as shoes
inner join (select region,
product,
min(sales) as min,
max(sales) as max
from sashelp.shoes
group by region,
product) as sum
on shoes.region = sum.region
and shoes.product = sum.product
order by shoes.region,
shoes.product,
shoes.sales;
quit;
Thanks LaurieF. I tried using a similar join as well but the code seems to be overly complicated while not reducing much of the CPU/Real time.
Good to know data step is superior is this case. Thanks!
Yeah - I thought as much. SAS SQL just doesn't have the facility to do it. I think you can do something similar in Oracle SQL, but it still requires a fair amount of code.
If you have pass through access to your database, the latest versions of SQL (ORACLE/DB2) have either cursor logic or WINDOW/PIVOT technology that helps with this somewhat. It's still more code and I don't find it as clear as a data step.
If it's a big flle and you use the data step approach, I would PROC SUMMARY instead of PROC SORT - much less input/output, ... and it preserves the original record order. Then use a hash lookup to set flag values:
proc summary data=sashelp.shoes (keep=region product sales) noprint nway;
class region product;
output out=need (drop=_type_ _freq_) min(sales)=minsale max(sales)=maxsale;
run;
data shoes2 (drop=rc minsale maxsale);
set sashelp.shoes end=eod;
flag=.;
if _n_=1 then do;
if 0 then set need;
declare hash h (dataset:'need' );
h.definekey('region','product');
h.definedata('minsale','maxsale');
h.definedone();
end;
rc=h.find();
if sales=minsale then flag=1;
else if sales=maxsale then flag=2;
run;
Oh very good - I do like a good hash table lookup.
Bear in mind for this solution, and the SQL method: if two products have the same sales value, the flag will be set incorrectly. The only easy way to be absolutely sure is to use first/last.
And also note: the original point was to find out if there was an SQL method for first/last processing, not whether there was another way of doing it.
I'd like to revise my original suggestion, in light of the possibility of there being multiple sales values for a region/product, by using monotonic() (after a sort!).
proc sql undo_policy=none magic=103;
create table shoes as
select *
from sashelp.shoes
order by region,
product,
sales;
create table shoes as
select *,
monotonic() as identifier
from shoes;
create table shoes(drop=identifier) as
select shoes.*,
case
when shoes.identifier = sum.min_id
then 1
when shoes.identifier = sum.max_id
then 2
else .
end as flag length=3
from shoes
inner join (select region,
product,
min(identifier) as min_id,
max(identifier) as max_id
from shoes
group by region,
product) as sum
on shoes.region = sum.region
and shoes.product = sum.product;
quit;
Again, this requires multiple leaps through hoops, and is ultimately unsatisfactory. Data steps FTW, as the young things say.
Thanks all for the wonderful suggestions! The hash table looks good but as LaurieF said it might not get the correct results. Looks like the first/last data step method is indeed the best.
Is the first./last. method as submitted REALLY the correct result? Do you mean you want to ignore ties at the minimum or maximum, and flag only one record for each? If so, what rule should be used to select which of the tied records gets flagged? The usual PROC SORT (with default option EQUALS) result would preserve original order among ties. So the first MIN and last MAX would be flagged. Is that the desired behavior?
Also, to be more general, you might want two flag variables instead of one. That would allow you to accomoate cases in which there is only one record in a group (i.e it's both min and max). The first./last. program assigns a flag=1 and never assigns a flag=2 for such a case.
Thanks mkeintz. Yes, I was aware that the program isn't ideal when there is only 1 observation in the BY group. I just wanted to find out if there is a more efficient way to achieve the same goal with proc sql.
I'm confused. I do not understand why it is that "if two products have the same sales value, the flag will be set incorrectly". The hash object proposal generates independent min's and max's for each product, regardless of whether they are identically distributed. Aside from the issue of whether to flag ties at max or at min WITHIN a region/product, I don't see how it would be different than the PROC-SORT-FOLLOWED-BY FIRST./LAST. technique.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.