Solved: By Group occurances

mlogan · Posted 02-22-2017 08:03 AM

Hi All,

I have the following table. How do I select all ID's of By Group where 'Math' or 'Bio' occured only after 01 Jan 2000.

data have;
input Grade $ ST_ID $ Course $ Score date;
informat date ddmmyy8.;
format date date9.;
cards; 
Grade10 101 Eng 20 01/01/82
Grade11 105 Bio 25 25/11/99
Grade11 105 Math 18 23/09/00
Grade11 105 Eng 15 11/12/01
Grade11 105 Math 15 14/12/05
Grade11 105 Phy 15 21/03/08
Grade11 102 Eng 15 13/07/02
Grade11 102 Bio 20 22/11/07
Grade11 102 Phy 16 19/04/88
Grade11 102 Math 21 27/08/99
Grade11 104 Arts 13 18/06/00
Grade11 104 Che 12 14/07/09
Grade11 104 Eng 18 09/03/08
Grade11 104 Bio 21 25/11/13
Grade12 103 Bio 10 13/12/99
Grade12 103 Math 20 25/05/01
Grade12 103 Eng 16 04/04/06
Grade12 103 Che 15 09/11/02
Grade12 103 Phy 17 16/03/15
;
RUN;

Expected Output:

Grade11 104 Arts 13 18Jun2000
Grade11 104 Che 12 14Jul2009
Grade11 104 Eng 18 09Mar2008
Grade11 104 Bio 21 25Nov2013

collinelliot · Posted 02-22-2017 10:28 AM

I'm guessing at least some of the other solutions here are fine, but here's my approach:

proc sql;
    CREATE TABLE want AS
    SELECT * FROM have
    GROUP BY st_id
    HAVING max((course in ('Bio', 'Math')) and (date > '01JAN2000'd)) = 1 AND
           max((course in ('Bio', 'Math')) and (date <= '01JAN2000'd)) = 0 
    ORDER BY grade, st_id;
quit;

View solution in original post

RW9 · Posted 02-22-2017 08:07 AM

Something like:

proc sql;
  create table WANT as 
  select  *
  from    HAVE
  where GRADE in (select distinct GRADE 
                  from   HAVE 
                  where  COURSE in ("Math","Eng")
                    and  SCORE > "01JAN2000"d);
quit;

mkeintz · Posted 02-22-2017 09:00 AM

@RW9

The op's request was "where 'Math' or 'Bio' occured only after 01 Jan 2000".

If it's not ambiguously put, and means exactly what it says, then you probably want to exclude by group's in which any 'Math' or 'Bio" occurred on or before 01jan2000 - maybe through a limit on the minimum score for 'Math' or 'Bio' within a group.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

RW9 · Posted 02-22-2017 09:13 AM

Yep, I got that. My code's sub-query creates a list of ID's only where they have a record for either Math or Eng and the date being after 01Jan2000. Then it selects the data with ID in that list of IDs. Seems to match what the OP asked?

mkeintz · Posted 02-22-2017 10:01 AM

My point is that if a by-group had a 'Math' or 'Bio' BOTH before and after the cut date, it is not clear from the OP's description that such a by group is wanted.

I.e. in the data set below do you want grade='Math before and After'? The phrase "where 'Math' or 'Bio' occured only after 01 Jan 2000" suggests an intention to exclude that by group (assuming that grade is the "by group" variable). That's why I suggested a within-group minimum criterion.

data have;

grade='Math before and after';

do score='01jan1990'd, '01jan2001'd;

do course='Math','Hist'; output; end;

end;

grade='Math before only';

score='01jan1990'd;

do course='Math','Hist'; output; end;

grade='Math After Only';

score='01jan2001'd;

do course='Math','Hist'; output; end;

format score date9.;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Kurt_Bremser · Posted 02-22-2017 10:08 AM

@RW9 wrote:

Yep, I got that. My code's sub-query creates a list of ID's only where they have a record for either Math or Eng and the date being after 01Jan2000. Then it selects the data with ID in that list of IDs. Seems to match what the OP asked?

Actually, when I ran your code with the slight change from "score" to "date":

data have;
input Grade $ ST_ID $ Course $ Score date;
informat date ddmmyy8.;
format date date9.;
cards; 
Grade10 101 Eng 20 01/01/82
Grade11 105 Bio 25 25/11/99
Grade11 105 Math 18 23/09/00
Grade11 105 Eng 15 11/12/01
Grade11 105 Math 15 14/12/05
Grade11 105 Phy 15 21/03/08
Grade11 102 Eng 15 13/07/02
Grade11 102 Bio 20 22/11/07
Grade11 102 Phy 16 19/04/88
Grade11 102 Math 21 27/08/99
Grade11 104 Arts 13 18/06/00
Grade11 104 Che 12 14/07/09
Grade11 104 Eng 18 09/03/08
Grade11 104 Bio 21 25/11/13
Grade12 103 Bio 10 13/12/99
Grade12 103 Math 20 25/05/01
Grade12 103 Eng 16 04/04/06
Grade12 103 Che 15 09/11/02
Grade12 103 Phy 17 16/03/15
;
run;

proc sql;
  create table WANT as 
  select  *
  from    HAVE
  where GRADE in (select distinct GRADE 
                  from   HAVE 
                  where  COURSE in ("Math","Eng")
                    and  date > "01JAN2000"d);
quit;

I got 18 observations. So the where condition and the variables used for matching need to be adapted. Since this got quite complicated quite quickly (at least for me), I chose the data step approach.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 02-22-2017 09:15 AM

I think I see clearer now. My suggestion is to create a lookup table for groups and merge back with that:

proc sort data=have;
by grade st_id;
run;

data lookup (keep=grade st_id);
set have;
by grade st_id ;
retain flag;
if first.st_id
then flag = .;
if course in ('Math','Bio')
then do;
  if flag lt 1
  then if date lt '01jan2000'd
    then flag = 1;
    else flag = 0;
end;
if last.st_id and missing(flag) or flag then output;
run;

data want;
merge
  have (in=a)
  lookup (in=b)
;
by grade st_id;
if a and not b;
run;

proc print data=want noobs;
run;

The result looks like the expected one:

 Grade     ST_ID    Course    Score         date

Grade11     104      Arts       13     18JUN2000
Grade11     104      Che        12     14JUL2009
Grade11     104      Eng        18     09MAR2008
Grade11     104      Bio        21     25NOV2013

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 02-22-2017 08:27 AM

What constitutes a by group (which variables)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

art297 · Posted 02-22-2017 09:45 AM

Good case for using a double dow:

data want (drop=keep);
  do until (last.ST_ID);
    set have;
    by ST_ID notsorted;
    if first.st_id then keep=0;
    if keep ne -1 then do;
      if strip(Course) in ('Bio', 'Math') and date lt '01jan2000'd then keep=-1;
      else if strip(Course) in ('Bio', 'Math') and date gt '01jan2000'd then keep=1;
    end;
  end;
  do until (last.ST_ID);
    set have;
    by ST_ID notsorted;
    if keep eq 1 then output;
  end;
run;

Interesting that the in operator wouldn't work as expected unless I stripped the Course variable.

Art, CEO, AnalystFinder.com

collinelliot · Posted 02-22-2017 10:28 AM

I'm guessing at least some of the other solutions here are fine, but here's my approach:

proc sql;
    CREATE TABLE want AS
    SELECT * FROM have
    GROUP BY st_id
    HAVING max((course in ('Bio', 'Math')) and (date > '01JAN2000'd)) = 1 AND
           max((course in ('Bio', 'Math')) and (date <= '01JAN2000'd)) = 0 
    ORDER BY grade, st_id;
quit;

Ksharp · Posted 02-22-2017 10:46 PM

Mine like Arthur's code.

data have;
input Grade $ ST_ID $ Course $ Score date;
informat date ddmmyy8.;
format date date9.;
cards; 
Grade10 101 Eng 20 01/01/82
Grade11 105 Bio 25 25/11/99
Grade11 105 Math 18 23/09/00
Grade11 105 Eng 15 11/12/01
Grade11 105 Math 15 14/12/05
Grade11 105 Phy 15 21/03/08
Grade11 102 Eng 15 13/07/02
Grade11 102 Bio 20 22/11/07
Grade11 102 Phy 16 19/04/88
Grade11 102 Math 21 27/08/99
Grade11 104 Arts 13 18/06/00
Grade11 104 Che 12 14/07/09
Grade11 104 Eng 18 09/03/08
Grade11 104 Bio 21 25/11/13
Grade12 103 Bio 10 13/12/99
Grade12 103 Math 20 25/05/01
Grade12 103 Eng 16 04/04/06
Grade12 103 Che 15 09/11/02
Grade12 103 Phy 17 16/03/15
;
run;
data want;
has_key=0;after_date=1;
  do until (last.ST_ID);
    set have;
    by grade ST_ID notsorted;
    if strip(Course) in ('Bio', 'Math') then do;
     has_key=1;
     if date lt '01jan2000'd then after_date=0;
    end;
  end;
  do until (last.ST_ID);
    set have;
    by grade ST_ID notsorted;
    if has_key and after_date then output;
  end;
drop has_key after_date;
run;

mkeintz · Posted 02-22-2017 11:07 PM

If a proc sort is acceptable, then you can have a VERY simple program:

proc sort data=have;
  by grade st_id;
run;

data want;
  merge have (in=in1 where=(course in ('Math','Bio') and date<'01jan2000'd))
        have (in=in2 where=(course in ('Math','Bio') and date>='01jan2000'd))
        have ;
  by grade st_id;
  if in2>in1;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

Re: By Group occurances

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away