Hello,
I have an interesting scenario where I have to delete/include the records in a dataset based on certain conditions.
Category Count
Asian 12
American 3
European 25
Unkown 31
All 71
Out of this dataset if any of categories except Unkown and All have a count less than 10, then I want the output dataset to have only Unkown and All. But if all the categories except Unkown and All have count of more than 10, then all the categories should be in the output dataset. But because I am going against the database, I am not entirely sure how many different categories will show up besides All and Unkown.
I have a coding mental block with reading the dataset more than once in the same datastep. Just wondering if there is an easy way with either Datastep/Proc Sql.
PROC SQL;
CREATE TABLE DEMOGRAPHICS_OP AS
SELECT CASE
WHEN CATEGORY IN ('ALL','UNKNOWN') THEN CATEGORY
WHEN CATEGORY NOT IN ('ALL','UNKNOWN') AND COUNT < 10 THEN ' '
ELSE CATEGORY
END AS CATEGORY
,COUNT
FROM DEMOGRAPHICS_BASE
;
QUIT;
Thanks,
saspert.
Would something like the following sufice?:
data conditional want;
set have end=lastone;
retain failed;
if Category in ('Unkown','All') then output want;
else do;
if not failed then do;
if count lt 10 then failed=1;
else output conditional;
end;
end;
if lastone and not failed then do;
call execute ('proc append base=want data=conditional;run;');
end;
run;
saspert, this would give what you wanted.
data cat;
input category $ count;
cards;
Asian 12
American 3
European 25
Unknown 31
All 71
;
run;
proc sql;
create table tmp as
select count(*) as totcnt,sum(case when count>10 then 1 else 0 end) as catcnt from cat where category not in ('Unknown','All');
create table result as
select * from cat where category in ('Unknown','All') or (select totcnt from tmp) = (select catcnt from tmp);
quit;
proc print data = tmp;
proc print data = result;
run;
FWIW, both suggestions result in the same output (well, at least if the actual spelling of unknown whatever it really is in the data is used), but the datastep approach takes one-half the time, particularly when the conditions aren't met (as in the example). Otherwise, they take about the same time to run.
Art
Thanks for the observation, Art. It is consistent with my experience that datastep is about always quicker than proc sql, especially when dealing with larger number of records in a dataset, with multiple tables.
Unfortunately It is not suitable for Cartesian Product.
SQL is designed based on Cartesian Product, sql would have some optimize.
Ksharp
Hi Art297 and Buckeye,
I tried both your solutions and they both work for me. Personally, I prefer the PROC SQL method because I am more comfortable with that. Also, the database values are summarized before this solution is used. So, I decided to go with the Proc SQL method. I should however learn the nuances of the data step/retain methods.
Thanks,
saspert.
You just need to use a summary function in SQL to determine if all of the variables have counts larger than ten. You can do it with or without another grouping variable in the dataset. This code takes advantage of the fact that SAS evaluates logical operations as 1 (true) or 0 (false).
data have ; length group $5 ; input Category $ Count; group = 'SMALL' ; output; group = 'LARGE' ; count=10+count; output; cards; Asian 12 American 3 European 25 Unknown 9 All 71 run; proc sql noprint ; create table want as select group , category , count from have group by group having min( (count > 10) or category = 'Unknown' ) = 1 or category in ('All' 'Unknown') order by group ,category ; quit;
Obs group Category Count 1 LARGE All 81 2 LARGE American 13 3 LARGE Asian 22 4 LARGE European 35 5 LARGE Unknown 19 6 SMALL All 71 7 SMALL Unknown 9
this is a candidate for a simple 2-pass solution.
First pass makes the decision which is retained to be applied in the second pass.
Data result ;
Set demog( in= firstpass ) demog ;
Retain decision 0;
If firstpass then do ;
If count<10 then do ;
If not lowcase( category ) in( 'unknown', 'all' )
Then decision = 1 ;
End ;
Delete ;
End;
If lowcase( category ) in ('all','unknown') or decision = 0;
run ;
It is good to know that aggregate function is not needed when grouping. This is different from SQL statements in languages rather than SAS.
Not sure what you discovered. Proc SQL does require the use of a summary function in order for a group statement to apply.
It appears it is not required in SAS proc sql as illustrated in Tom's solution below:
select group
, category
, count
from have
group by group
Aggregate functions, such as min(), max(), sum(), are not required for, in this case, category and count. Hope this clarifies.
The program does use the MIN() aggregate function in the HAVING clause.
But it is not required that an aggregate statistic be include in the selected variables.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.