Solved: Re: select rows with condition

zihdonv19 · Posted 12-01-2023 10:25 AM

I have data below

data have;
   input id 	type	visit_date :yymmdd10.;
   format visit_date yymmdd10.;
datalines;
1 	1 	2007-01-01
1 	2 	2008-01-02
1 	2 	2009-01-03
1 	3 	2010-01-04
1 	5 	2011-01-05
2 	1 	2008-01-01
2 	2 	2008-01-02
2 	2 	2009-01-03
2 	3 	2010-01-04
2 	5 	2011-01-05
3 	2 	2007-01-09
3 	. 	2008-01-10
3 	2 	2009-01-11
3 	. 	2010-01-12
4 	. 	2007-01-10
4 	3 	2008-01-11
4 	1 	2009-01-12
5 	2 	2007-01-11
5 	2 	2008-01-12
5 	2 	2009-01-10
5 	2 	2010-01-11
6 	2 	2008-01-12
6 	1 	2009-01-10
6 	. 	2010-01-11
;
proc sort data=have; by id visit_date ;run;

I want to select rows for the id, whose id have non-missing "type" for each "visit_year" of 2007-2011.

So, for the dataset above, I would expect I will get row 1-5.

Patrick · Posted 12-03-2023 03:24 AM

To re-formulate your requirement:
I want to select rows for the id, whose id have NEVER a missing "type" for "visit_year" between 2007 to 2011 AND for which there is at least one row per year.

I assume there could also be multiple visit dates (rows) for an ID in a single year.

Below code basically using what @Kurt_Bremser proposed with some tweaks to check the type variable and ensuring that there can't be an out of array subscript condition. I've also amended your sample data slightly to make it a bit more representative for how I assume your real data looks like (=multiple rows per year for an id - of which one row could have a missing type).

data have;
  infile datalines dsd truncover;
  input id  type visit_date :yymmdd10.;
  format visit_date yymmdd10.;
  datalines;
1,1,2007-01-01
1,1,2007-05-01
1,2,2008-01-02
1,2,2009-01-03
1,3,2010-01-04
1,5,2011-01-05
9,1,2007-01-01
9,1,2007-05-01
9,.,2007-06-01
9,2,2008-01-02
9,2,2009-01-03
9,3,2010-01-04
9,5,2011-01-05
2,1,2008-01-01
2,2,2008-01-02
2,2,2009-01-03
2,3,2010-01-04
2,5,2011-01-05
2,5,2012-01-05
3,2,2007-01-09
3,.,2008-01-10
3,2,2009-01-11
3,.,2010-01-12
3,.,2011-01-12
4,.,2007-01-10
4,3,2008-01-11
4,1,2009-01-12
5,2,2007-01-11
5,2,2008-01-12
5,2,2009-01-10
5,2,2010-01-11
6,2,2008-01-12
6,1,2009-01-10
6,.,2010-01-11
;

proc sort data=have;
  by id;
run;

data want;
  array _y{2007:2011};
  call missing(of _y[*]);
  do until (last.id);
    set have;
    by id;
    if 2007<=year(visit_date)<=2011 then
      do;
        if missing(type) then _y{year(visit_date)} = 0;
        else _y{year(visit_date)} = _y{year(visit_date)} ne 0;
      end;
  end;
  do until (last.id);
    set have;
    by id;
    if sum(of _y:) = 5 then output;
  end;
  drop _y:;
run;

proc print data=want;
run;

View solution in original post

A_Kh · Posted 12-01-2023 11:44 AM

If you need to see only rows 1-5 in the output, please explain further showing the expected output data as well.
What I see from your data is each visit (visit_date) is unique per id. If you select only non missing TYPE (where TYPE ^=. ) then you get more observation than 5.

zihdonv19 · Posted 12-01-2023 01:12 PM

Hi!
The reason I expected only the first 5 rows in the output is that for id 1, their "type" is not missing for year 2007, 2008, 2009, 2010 and 2011.

A_Kh · Posted 12-01-2023 02:10 PM

Assuming that your data, when 2007 and 2011 years present, will have all years of 2007 -2011 per ID:

proc sql;
	create table want as
		select*
			from have
		         where type ne . and id in (select id from have 
										        group by id having min(year(visit_date)) eq 2007 and max(year(visit_date)) eq 2011);
quit;

Kurt_Bremser · Posted 12-01-2023 03:51 PM

data want;
array _y{2007:2011};
do until (last.id);
  set have;
  by id;
  _y{year(visit_date)} = 1;
end;
do until (last.id);
  set have;
  by id;
  if sum(of _y:) = 5 then output;
end;
drop _y:;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 12-02-2023 05:14 AM

data have;
   input id 	type	visit_date :yymmdd10.;
   format visit_date yymmdd10.;
datalines;
1 	1 	2007-01-01
1 	2 	2008-01-02
1 	2 	2009-01-03
1 	3 	2010-01-04
1 	5 	2011-01-05
2 	1 	2008-01-01
2 	2 	2008-01-02
2 	2 	2009-01-03
2 	3 	2010-01-04
2 	5 	2011-01-05
3 	2 	2007-01-09
3 	. 	2008-01-10
3 	2 	2009-01-11
3 	. 	2010-01-12
4 	. 	2007-01-10
4 	3 	2008-01-11
4 	1 	2009-01-12
5 	2 	2007-01-11
5 	2 	2008-01-12
5 	2 	2009-01-10
5 	2 	2010-01-11
6 	2 	2008-01-12
6 	1 	2009-01-10
6 	. 	2010-01-11
;
proc sql;
create table want as
select *
 from have
  where type is not missing and year(visit_date) between 2007 and 2011
   group by id
    having count(distinct year(visit_date))=5;
quit;

mkeintz · Posted 12-02-2023 10:18 PM

This code reads each ID twice, as in @Kurt_Bremser 's example. The first pass reads only non-missing TYPE values and builds a profile of years encountered. But the subsetting IF allows output only during second pass, if the profile is all 1's.

data have;
   input id 	type	visit_date :yymmdd10.;
   format visit_date yymmdd10.;
datalines;
1 	1 	2007-01-01
1 	2 	2008-01-02
1 	2 	2009-01-03
1 	3 	2010-01-04
1 	5 	2011-01-05
2 	1 	2008-01-01
2 	2 	2008-01-02
2 	2 	2009-01-03
2 	3 	2010-01-04
2 	5 	2011-01-05
3 	2 	2007-01-09
3 	. 	2008-01-10
3 	2 	2009-01-11
3 	. 	2010-01-12
4 	. 	2007-01-10
4 	3 	2008-01-11
4 	1 	2009-01-12
5 	2 	2007-01-11
5 	2 	2008-01-12
5 	2 	2009-01-10
5 	2 	2010-01-11
6 	2 	2008-01-12
6 	1 	2009-01-10
6 	. 	2010-01-11
6 	9 	2013-01-11
run;

data want;
  set have (where=(missing(type)=0) in=firstpass)
      have (in=secondpass);
  by id;

  array years {2007:2011} _temporary_;
  if first.id then call missing(of years{*});

  if firstpass then years{year(visit_date)}=1;

  if secondpass=1 and sum(of years{*})=5;
run;

The code assumes that all the data are in the desired years (2007-2011). If earlier or later visit_dates are possible then change the "where" option for the firstpass as in:

  set have (where=(missing(type)=0 and '01jan2007'd<=visit_date<='31dec2011'd) in=firstpass)
      have (in=secondpass);

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 12-03-2023 03:24 AM

To re-formulate your requirement:
I want to select rows for the id, whose id have NEVER a missing "type" for "visit_year" between 2007 to 2011 AND for which there is at least one row per year.

I assume there could also be multiple visit dates (rows) for an ID in a single year.

Below code basically using what @Kurt_Bremser proposed with some tweaks to check the type variable and ensuring that there can't be an out of array subscript condition. I've also amended your sample data slightly to make it a bit more representative for how I assume your real data looks like (=multiple rows per year for an id - of which one row could have a missing type).

data have;
  infile datalines dsd truncover;
  input id  type visit_date :yymmdd10.;
  format visit_date yymmdd10.;
  datalines;
1,1,2007-01-01
1,1,2007-05-01
1,2,2008-01-02
1,2,2009-01-03
1,3,2010-01-04
1,5,2011-01-05
9,1,2007-01-01
9,1,2007-05-01
9,.,2007-06-01
9,2,2008-01-02
9,2,2009-01-03
9,3,2010-01-04
9,5,2011-01-05
2,1,2008-01-01
2,2,2008-01-02
2,2,2009-01-03
2,3,2010-01-04
2,5,2011-01-05
2,5,2012-01-05
3,2,2007-01-09
3,.,2008-01-10
3,2,2009-01-11
3,.,2010-01-12
3,.,2011-01-12
4,.,2007-01-10
4,3,2008-01-11
4,1,2009-01-12
5,2,2007-01-11
5,2,2008-01-12
5,2,2009-01-10
5,2,2010-01-11
6,2,2008-01-12
6,1,2009-01-10
6,.,2010-01-11
;

proc sort data=have;
  by id;
run;

data want;
  array _y{2007:2011};
  call missing(of _y[*]);
  do until (last.id);
    set have;
    by id;
    if 2007<=year(visit_date)<=2011 then
      do;
        if missing(type) then _y{year(visit_date)} = 0;
        else _y{year(visit_date)} = _y{year(visit_date)} ne 0;
      end;
  end;
  do until (last.id);
    set have;
    by id;
    if sum(of _y:) = 5 then output;
  end;
  drop _y:;
run;

proc print data=want;
run;

zihdonv19 · Posted 12-06-2023 04:02 AM

Hi Patrick,

Indeed, you have a valid point! I didn't explain myself clearly before I asked this question, but you made me rethink what kind of data I really have.

May I ask a follow-up question? What if I want the year range to be the first year of visit to 4 years after the first year of visit, instead of a fixed range 2007-2011? For example, for id1 the year range I want to look at should be 2007-2011, id2 2008-2012, id3 2007-2011, id4 2007-2011 etc.

Patrick · Posted 12-06-2023 06:15 AM

Code like below should work. I've now also added parameter &n_years (%let n_years=5) which lets you select in a single place how many years you want to include into the selection window.

proc sort data=have;
  by id visit_date;
run;

%let n_years=5;
data want;
  array _y{&n_years};
  call missing(of _y[*]);
  do until (last.id);
    set have;
    by id visit_date;

    visit_year=year(visit_date);
    if first.id then _first_year=visit_year;

    _ind=visit_year - _first_year +1;

    if 1<=_ind<=&n_years then
      do;
        if missing(type) then _y{_ind} = 0;
        else _y{_ind} = _y{_ind} ne 0;
      end;
  end;
  do until (last.id);
    set have;
    by id visit_date;
    if sum(of _y:) = &n_years then output;
  end;
  drop _:;
run;

proc print data=want;
run;

zihdonv19 · Posted 12-06-2023 09:06 AM

thanks! this is magic 🙂

Classroom Training Available!