DATA Step, Macro, Functions and more

Subset data using 'by'

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

Subset data using 'by'

Group     Year

A     2011

A     2011

A     2011

A     2011

A     2010

A     2010

A     2010

B     2011

B     2011

B     2011

B     2010

B     2010

B     2009

B     2009

.     .

.     .

.     .

Data Table4;

set Table3;

by group;

if year in (2009, 2010, 2011);

run;

My goal is to end up a dataset with only group that have observations from 2009 and 2010 and 2011 (not just 2010 and 2011). Each observation has only one year, but the group has multiple years. In the complete data set there are roughly 2000 groups with only 500 that have all three years....I've tried all the permutations i can think of, i.e:

Data Table4;

set Table3;

by group;

if (year = 2009) && (year= 2010) && (year= 2011);          /*This data set return nothing when it SHOULD return a data set with all of group B

run;

I've scoured the web and every other resource available but nothing has worked as I need it. I've also tried a solution in PROC SQL, but it's just as clunky....Thank you for any advice


Accepted Solutions
Solution
‎01-24-2012 01:23 PM
New Contributor
Posts: 4

Subset data using 'by'

If using SQL in one step,

to cheat:

proc sql;

create table want as

select * from have

group by group

having count (distinct year)=3

;

quit;

or for more general purpose:

proc sql;

create table want as

select * from have

group by group

having sum(year=2009)*sum(year=2010)*sum(year=2011)>0

;

quit;

Kindly Regards,

Haikuo

View solution in original post


All Replies
Super Contributor
Posts: 1,636

Re: Subset data using 'by'

Do you want something like this?

data have;

input group $ :year;

cards;

A     2011

A     2011

A     2011

A     2011

A     2010

A     2010

A     2010

B     2011

B     2011

B     2011

B     2010

B     2010

B     2009

B     2009

;

proc sort data=have out=temp (where=(year in (2009,2010,2011))) nodupkey;

by group year;

data temp;

  set temp;

  by group;

  count + (-first.group*count) + 1;

  if count=3;

proc sql;

   create table want as select * from have

     where group in (select group from temp)

       order by group, year;

quit;

proc print data=want;

run;

Obs    group    year

1       B      2009

2       B      2009

3       B      2010

4       B      2010

5       B      2011

6       B      2011

7       B      2011

Linlin

Solution
‎01-24-2012 01:23 PM
New Contributor
Posts: 4

Subset data using 'by'

If using SQL in one step,

to cheat:

proc sql;

create table want as

select * from have

group by group

having count (distinct year)=3

;

quit;

or for more general purpose:

proc sql;

create table want as

select * from have

group by group

having sum(year=2009)*sum(year=2010)*sum(year=2011)>0

;

quit;

Kindly Regards,

Haikuo

New Contributor
Posts: 4

Subset data using 'by'

Although SQL approach is more native for this problem, here could be one of the Data Step solutions:

data have;

input group $ :year;

cards;

A     2011

A     2011

A     2011

A     2011

A     2010

A     2010

A     2010

B     2011

B     2011

B     2011

B     2010

B     2010

B     2009

B     2009

;

data want (drop=_Smiley Happy;

retain _y _c ;

do until (last.group);

   set have;

   by group descending year ;

     if first.group then

        do;

          _y=year;

if _y in (2009,2010,2011) then _c=1;

end;

if _y ne year and year in (2009,2010,2011) then

do;

_y=year;

_c+1;

end;

end;

do until (last.group);

    set have;

by group descending year ;

if _c=3 then output;

end;

_c=0;

run;

Kindly Regards,

Haikuo

Super User
Posts: 17,907

Subset data using 'by'

I think you're looking for OR rather than AND

b/c for a specific observation couldn't be 2009/10/11 but it could be either.

Super User
Super User
Posts: 6,502

Re: Subset data using 'by'

You can use two DOW loops.

data want ;

  y2009=0;

  y2010=0;

  y2011=0;

  do until (last.group);

    set have (keep=group year);

    by group;

    if year=2009 then y2009=1;

    if year=2010 then y2010=1;

    if year=2011 then y2011=1;

  end;

  do until (last.group);

    set have;

    by group;

    if y2009 and y2010 and y2011 then output;

  end;

run;

Valued Guide
Posts: 765

Re: Subset data using 'by'

hi ... another double DOW idea ...

data want (drop=years);

length years $200;

do until (last.group);

  set have;

  by group;

  if ^find(years,cat(year)) and year in (2009:2011) then years=catx(',',years,year);

end;

do until (last.group);

  set have;

  by group;

  if length(years) eq 14 then output;

end;

run;

if there are only data from 2009 through 2011 ...

if ^find(years,cat(year)) then years=catx(',',years,year);

Occasional Contributor
Posts: 7

Subset data using 'by'

I've used a PROC SQL statement similar to Haikuo's solution

However, this was very impression by all people. Thank you so much for the responses. I plan on using this forum in the future if I am stumped in the future!

Cheers!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 229 views
  • 1 like
  • 6 in conversation