Conditional Data Step/Proc Sql

saspert · Posted 07-14-2011 10:41 AM

Hello,

I have an interesting scenario where I have to delete/include the records in a dataset based on certain conditions.

Category Count

Asian 12

American 3

European 25

Unkown 31

All 71

Out of this dataset if any of categories except Unkown and All have a count less than 10, then I want the output dataset to have only Unkown and All. But if all the categories except Unkown and All have count of more than 10, then all the categories should be in the output dataset. But because I am going against the database, I am not entirely sure how many different categories will show up besides All and Unkown.

I have a coding mental block with reading the dataset more than once in the same datastep. Just wondering if there is an easy way with either Datastep/Proc Sql.

PROC SQL;

CREATE TABLE DEMOGRAPHICS_OP AS

SELECT CASE

WHEN CATEGORY IN ('ALL','UNKNOWN') THEN CATEGORY

WHEN CATEGORY NOT IN ('ALL','UNKNOWN') AND COUNT < 10 THEN ' '

ELSE CATEGORY

END AS CATEGORY

,COUNT

FROM DEMOGRAPHICS_BASE

;

QUIT;

Thanks,

saspert.

art297 · Posted 07-14-2011 11:41 AM

Would something like the following sufice?:

data conditional want;

set have end=lastone;

retain failed;

if Category in ('Unkown','All') then output want;

else do;

if not failed then do;

if count lt 10 then failed=1;

else output conditional;

end;

if lastone and not failed then do;

call execute ('proc append base=want data=conditional;run;');

end;

run;

buckeye · Posted 07-15-2011 09:29 AM

saspert, this would give what you wanted.

data cat;
     input category $ count;
   cards;
Asian          12
American      3
European      25
Unknown      31
All               71
;
run;

proc sql;
create table tmp as
select count(*) as totcnt,sum(case when count>10 then 1 else 0 end) as catcnt from cat where category not in ('Unknown','All');

create table result as
select * from cat where category in ('Unknown','All') or (select totcnt from tmp) = (select catcnt from tmp);
quit;

proc print data = tmp;
proc print data = result;
run;

art297 · Posted 07-15-2011 09:41 AM

FWIW, both suggestions result in the same output (well, at least if the actual spelling of unknown whatever it really is in the data is used), but the datastep approach takes one-half the time, particularly when the conditions aren't met (as in the example). Otherwise, they take about the same time to run.

Art

buckeye · Posted 07-15-2011 10:12 AM

Thanks for the observation, Art. It is consistent with my experience that datastep is about always quicker than proc sql, especially when dealing with larger number of records in a dataset, with multiple tables.

Ksharp · Posted 07-18-2011 02:22 AM

Unfortunately It is not suitable for Cartesian Product.

SQL is designed based on Cartesian Product, sql would have some optimize.

Ksharp

saspert · Posted 07-18-2011 05:02 PM

Hi Art297 and Buckeye,

I tried both your solutions and they both work for me. Personally, I prefer the PROC SQL method because I am more comfortable with that. Also, the database values are summarized before this solution is used. So, I decided to go with the Proc SQL method. I should however learn the nuances of the data step/retain methods.

Thanks,

saspert.

Tom · Posted 07-19-2011 01:04 AM

You just need to use a summary function in SQL to determine if all of the variables have counts larger than ten. You can do it with or without another grouping variable in the dataset. This code takes advantage of the fact that SAS evaluates logical operations as 1 (true) or 0 (false).

data have ;
  length group $5 ;
  input Category $ Count;
  group = 'SMALL' ; output;
  group = 'LARGE' ; count=10+count; output;
cards;
Asian    12
American  3
European 25
Unknown   9
All      71
run;
 
proc sql noprint ;
  create table want as 
    select group
         , category
         , count
      from have
      group by group
      having min( (count > 10) or category = 'Unknown' ) = 1 
          or category in ('All' 'Unknown')
      order by group ,category
  ;
quit;

Obs group Category Count
 1  LARGE All        81
 2  LARGE American   13
 3  LARGE Asian      22
 4  LARGE European   35
 5  LARGE Unknown    19
 6  SMALL All        71
 7  SMALL Unknown     9

Peter_C · Posted 07-19-2011 04:02 AM

this is a candidate for a simple 2-pass solution.

First pass makes the decision which is retained to be applied in the second pass.

Data result ;

Set demog( in= firstpass ) demog ;

Retain decision 0;

If firstpass then do ;

If count<10 then do ;

If not lowcase( category ) in( 'unknown', 'all' )

Then decision = 1 ;

End ;

Delete ;

End;

If lowcase( category ) in ('all','unknown') or decision = 0;

run ;

buckeye · Posted 07-19-2011 09:22 AM

It is good to know that aggregate function is not needed when grouping. This is different from SQL statements in languages rather than SAS.

art297 · Posted 07-19-2011 09:31 AM

Not sure what you discovered. Proc SQL does require the use of a summary function in order for a group statement to apply.

buckeye · Posted 07-19-2011 10:05 AM

It appears it is not required in SAS proc sql as illustrated in Tom's solution below:

select group

, category

, count

from have

group by group

Aggregate functions, such as min(), max(), sum(), are not required for, in this case, category and count. Hope this clarifies.

Tom · Posted 07-19-2011 10:11 AM

The program does use the MIN() aggregate function in the HAVING clause.

But it is not required that an aggregate statistic be include in the selected variables.