07-14-2011 10:41 AM
I have an interesting scenario where I have to delete/include the records in a dataset based on certain conditions.
Out of this dataset if any of categories except Unkown and All have a count less than 10, then I want the output dataset to have only Unkown and All. But if all the categories except Unkown and All have count of more than 10, then all the categories should be in the output dataset. But because I am going against the database, I am not entirely sure how many different categories will show up besides All and Unkown.
I have a coding mental block with reading the dataset more than once in the same datastep. Just wondering if there is an easy way with either Datastep/Proc Sql.
CREATE TABLE DEMOGRAPHICS_OP AS
WHEN CATEGORY IN ('ALL','UNKNOWN') THEN CATEGORY
WHEN CATEGORY NOT IN ('ALL','UNKNOWN') AND COUNT < 10 THEN ' '
END AS CATEGORY
07-14-2011 11:41 AM
Would something like the following sufice?:
data conditional want;
set have end=lastone;
if Category in ('Unkown','All') then output want;
if not failed then do;
if count lt 10 then failed=1;
else output conditional;
if lastone and not failed then do;
call execute ('proc append base=want data=conditional;run;');
07-15-2011 09:29 AM
saspert, this would give what you wanted.
input category $ count;
create table tmp as
select count(*) as totcnt,sum(case when count>10 then 1 else 0 end) as catcnt from cat where category not in ('Unknown','All');
create table result as
select * from cat where category in ('Unknown','All') or (select totcnt from tmp) = (select catcnt from tmp);
proc print data = tmp;
proc print data = result;
07-15-2011 09:41 AM
FWIW, both suggestions result in the same output (well, at least if the actual spelling of unknown whatever it really is in the data is used), but the datastep approach takes one-half the time, particularly when the conditions aren't met (as in the example). Otherwise, they take about the same time to run.
07-15-2011 10:12 AM
Thanks for the observation, Art. It is consistent with my experience that datastep is about always quicker than proc sql, especially when dealing with larger number of records in a dataset, with multiple tables.
07-18-2011 05:02 PM
Hi Art297 and Buckeye,
I tried both your solutions and they both work for me. Personally, I prefer the PROC SQL method because I am more comfortable with that. Also, the database values are summarized before this solution is used. So, I decided to go with the Proc SQL method. I should however learn the nuances of the data step/retain methods.
07-19-2011 01:04 AM
You just need to use a summary function in SQL to determine if all of the variables have counts larger than ten. You can do it with or without another grouping variable in the dataset. This code takes advantage of the fact that SAS evaluates logical operations as 1 (true) or 0 (false).
data have ; length group $5 ; input Category $ Count; group = 'SMALL' ; output; group = 'LARGE' ; count=10+count; output; cards; Asian 12 American 3 European 25 Unknown 9 All 71 run; proc sql noprint ; create table want as select group , category , count from have group by group having min( (count > 10) or category = 'Unknown' ) = 1 or category in ('All' 'Unknown') order by group ,category ; quit;
Obs group Category Count 1 LARGE All 81 2 LARGE American 13 3 LARGE Asian 22 4 LARGE European 35 5 LARGE Unknown 19 6 SMALL All 71 7 SMALL Unknown 9
07-19-2011 04:02 AM
this is a candidate for a simple 2-pass solution.
First pass makes the decision which is retained to be applied in the second pass.
Data result ;
Set demog( in= firstpass ) demog ;
Retain decision 0;
If firstpass then do ;
If count<10 then do ;
If not lowcase( category ) in( 'unknown', 'all' )
Then decision = 1 ;
If lowcase( category ) in ('all','unknown') or decision = 0;
07-19-2011 10:05 AM
It appears it is not required in SAS proc sql as illustrated in Tom's solution below:
group by group
Aggregate functions, such as min(), max(), sum(), are not required for, in this case, category and count. Hope this clarifies.
07-19-2011 10:11 AM
The program does use the MIN() aggregate function in the HAVING clause.
But it is not required that an aggregate statistic be include in the selected variables.