BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
zihdonv19
Quartz | Level 8

I have data below

data have;
   input id 	type	visit_date :yymmdd10.;
   format visit_date yymmdd10.;
datalines;
1 	1 	2007-01-01
1 	2 	2008-01-02
1 	2 	2009-01-03
1 	3 	2010-01-04
1 	5 	2011-01-05
2 	1 	2008-01-01
2 	2 	2008-01-02
2 	2 	2009-01-03
2 	3 	2010-01-04
2 	5 	2011-01-05
3 	2 	2007-01-09
3 	. 	2008-01-10
3 	2 	2009-01-11
3 	. 	2010-01-12
4 	. 	2007-01-10
4 	3 	2008-01-11
4 	1 	2009-01-12
5 	2 	2007-01-11
5 	2 	2008-01-12
5 	2 	2009-01-10
5 	2 	2010-01-11
6 	2 	2008-01-12
6 	1 	2009-01-10
6 	. 	2010-01-11
;
proc sort data=have; by id visit_date ;run;

I want to select rows for the id, whose id have non-missing "type" for each "visit_year" of 2007-2011.

 

So, for the dataset above, I would expect I will get row 1-5.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

To re-formulate your requirement:
I want to select rows for the id, whose id have NEVER a missing "type" for "visit_year" between 2007 to 2011 AND for which there is at least one row per year.

 

I assume there could also be multiple visit dates (rows) for an ID in a single year. 

 

Below code basically using what @Kurt_Bremser proposed with some tweaks to check the type variable and ensuring that there can't be an out of array subscript condition. I've also amended your sample data slightly to make it a bit more representative for how I assume your real data looks like (=multiple rows per year for an id - of which one row could have a missing type).

data have;
  infile datalines dsd truncover;
  input id  type visit_date :yymmdd10.;
  format visit_date yymmdd10.;
  datalines;
1,1,2007-01-01
1,1,2007-05-01
1,2,2008-01-02
1,2,2009-01-03
1,3,2010-01-04
1,5,2011-01-05
9,1,2007-01-01
9,1,2007-05-01
9,.,2007-06-01
9,2,2008-01-02
9,2,2009-01-03
9,3,2010-01-04
9,5,2011-01-05
2,1,2008-01-01
2,2,2008-01-02
2,2,2009-01-03
2,3,2010-01-04
2,5,2011-01-05
2,5,2012-01-05
3,2,2007-01-09
3,.,2008-01-10
3,2,2009-01-11
3,.,2010-01-12
3,.,2011-01-12
4,.,2007-01-10
4,3,2008-01-11
4,1,2009-01-12
5,2,2007-01-11
5,2,2008-01-12
5,2,2009-01-10
5,2,2010-01-11
6,2,2008-01-12
6,1,2009-01-10
6,.,2010-01-11
;

proc sort data=have;
  by id;
run;

data want;
  array _y{2007:2011};
  call missing(of _y[*]);
  do until (last.id);
    set have;
    by id;
    if 2007<=year(visit_date)<=2011 then
      do;
        if missing(type) then _y{year(visit_date)} = 0;
        else _y{year(visit_date)} = _y{year(visit_date)} ne 0;
      end;
  end;
  do until (last.id);
    set have;
    by id;
    if sum(of _y:) = 5 then output;
  end;
  drop _y:;
run;

proc print data=want;
run;

 

View solution in original post

10 REPLIES 10
A_Kh
Barite | Level 11

If you need to see only rows 1-5 in the output, please explain further showing the expected output data as well. 
What I see from your data is each visit (visit_date) is unique per id. If you select only non missing TYPE (where TYPE ^=. ) then you get more observation than 5. 

zihdonv19
Quartz | Level 8
Hi!
The reason I expected only the first 5 rows in the output is that for id 1, their "type" is not missing for year 2007, 2008, 2009, 2010 and 2011.
A_Kh
Barite | Level 11

Assuming that your data, when 2007 and 2011 years present, will have all years of 2007 -2011 per ID:

proc sql;
	create table want as
		select*
			from have
		         where type ne . and id in (select id from have 
										        group by id having min(year(visit_date)) eq 2007 and max(year(visit_date)) eq 2011);
quit; 
Kurt_Bremser
Super User
data want;
array _y{2007:2011};
do until (last.id);
  set have;
  by id;
  _y{year(visit_date)} = 1;
end;
do until (last.id);
  set have;
  by id;
  if sum(of _y:) = 5 then output;
end;
drop _y:;
run;
Ksharp
Super User
data have;
   input id 	type	visit_date :yymmdd10.;
   format visit_date yymmdd10.;
datalines;
1 	1 	2007-01-01
1 	2 	2008-01-02
1 	2 	2009-01-03
1 	3 	2010-01-04
1 	5 	2011-01-05
2 	1 	2008-01-01
2 	2 	2008-01-02
2 	2 	2009-01-03
2 	3 	2010-01-04
2 	5 	2011-01-05
3 	2 	2007-01-09
3 	. 	2008-01-10
3 	2 	2009-01-11
3 	. 	2010-01-12
4 	. 	2007-01-10
4 	3 	2008-01-11
4 	1 	2009-01-12
5 	2 	2007-01-11
5 	2 	2008-01-12
5 	2 	2009-01-10
5 	2 	2010-01-11
6 	2 	2008-01-12
6 	1 	2009-01-10
6 	. 	2010-01-11
;
proc sql;
create table want as
select *
 from have
  where type is not missing and year(visit_date) between 2007 and 2011
   group by id
    having count(distinct year(visit_date))=5;
quit;
mkeintz
PROC Star

This code reads each ID twice, as in @Kurt_Bremser 's example.  The first pass reads only non-missing TYPE values and builds a profile of years encountered.  But the subsetting IF allows output only during second pass, if the profile is all 1's.

 

data have;
   input id 	type	visit_date :yymmdd10.;
   format visit_date yymmdd10.;
datalines;
1 	1 	2007-01-01
1 	2 	2008-01-02
1 	2 	2009-01-03
1 	3 	2010-01-04
1 	5 	2011-01-05
2 	1 	2008-01-01
2 	2 	2008-01-02
2 	2 	2009-01-03
2 	3 	2010-01-04
2 	5 	2011-01-05
3 	2 	2007-01-09
3 	. 	2008-01-10
3 	2 	2009-01-11
3 	. 	2010-01-12
4 	. 	2007-01-10
4 	3 	2008-01-11
4 	1 	2009-01-12
5 	2 	2007-01-11
5 	2 	2008-01-12
5 	2 	2009-01-10
5 	2 	2010-01-11
6 	2 	2008-01-12
6 	1 	2009-01-10
6 	. 	2010-01-11
6 	9 	2013-01-11
run;

data want;
  set have (where=(missing(type)=0) in=firstpass)
      have (in=secondpass);
  by id;

  array years {2007:2011} _temporary_;
  if first.id then call missing(of years{*});

  if firstpass then years{year(visit_date)}=1;

  if secondpass=1 and sum(of years{*})=5;
run;

The code assumes that all the data are in the desired years (2007-2011).  If earlier or later visit_dates are possible then change the "where" option for the firstpass as in:

 

  set have (where=(missing(type)=0 and '01jan2007'd<=visit_date<='31dec2011'd) in=firstpass)
      have (in=secondpass);

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

To re-formulate your requirement:
I want to select rows for the id, whose id have NEVER a missing "type" for "visit_year" between 2007 to 2011 AND for which there is at least one row per year.

 

I assume there could also be multiple visit dates (rows) for an ID in a single year. 

 

Below code basically using what @Kurt_Bremser proposed with some tweaks to check the type variable and ensuring that there can't be an out of array subscript condition. I've also amended your sample data slightly to make it a bit more representative for how I assume your real data looks like (=multiple rows per year for an id - of which one row could have a missing type).

data have;
  infile datalines dsd truncover;
  input id  type visit_date :yymmdd10.;
  format visit_date yymmdd10.;
  datalines;
1,1,2007-01-01
1,1,2007-05-01
1,2,2008-01-02
1,2,2009-01-03
1,3,2010-01-04
1,5,2011-01-05
9,1,2007-01-01
9,1,2007-05-01
9,.,2007-06-01
9,2,2008-01-02
9,2,2009-01-03
9,3,2010-01-04
9,5,2011-01-05
2,1,2008-01-01
2,2,2008-01-02
2,2,2009-01-03
2,3,2010-01-04
2,5,2011-01-05
2,5,2012-01-05
3,2,2007-01-09
3,.,2008-01-10
3,2,2009-01-11
3,.,2010-01-12
3,.,2011-01-12
4,.,2007-01-10
4,3,2008-01-11
4,1,2009-01-12
5,2,2007-01-11
5,2,2008-01-12
5,2,2009-01-10
5,2,2010-01-11
6,2,2008-01-12
6,1,2009-01-10
6,.,2010-01-11
;

proc sort data=have;
  by id;
run;

data want;
  array _y{2007:2011};
  call missing(of _y[*]);
  do until (last.id);
    set have;
    by id;
    if 2007<=year(visit_date)<=2011 then
      do;
        if missing(type) then _y{year(visit_date)} = 0;
        else _y{year(visit_date)} = _y{year(visit_date)} ne 0;
      end;
  end;
  do until (last.id);
    set have;
    by id;
    if sum(of _y:) = 5 then output;
  end;
  drop _y:;
run;

proc print data=want;
run;

 

zihdonv19
Quartz | Level 8

Hi Patrick,

 

Indeed, you have a valid point! I didn't explain myself clearly before I asked this question, but you made me rethink what kind of data I really have.

 

May I ask a follow-up question? What if I want the year range to be the first year of visit to 4 years after the first year of visit, instead of a fixed range 2007-2011? For example, for id1 the year range I want to look at should be 2007-2011, id2 2008-2012, id3 2007-2011, id4 2007-2011 etc.

Patrick
Opal | Level 21

Code like below should work. I've now also added parameter &n_years (%let n_years=5) which lets you select in a single place how many years you want to include into the selection window.

proc sort data=have;
  by id visit_date;
run;

%let n_years=5;
data want;
  array _y{&n_years};
  call missing(of _y[*]);
  do until (last.id);
    set have;
    by id visit_date;

    visit_year=year(visit_date);
    if first.id then _first_year=visit_year;

    _ind=visit_year - _first_year +1;

    if 1<=_ind<=&n_years then
      do;
        if missing(type) then _y{_ind} = 0;
        else _y{_ind} = _y{_ind} ne 0;
      end;
  end;
  do until (last.id);
    set have;
    by id visit_date;
    if sum(of _y:) = &n_years then output;
  end;
  drop _:;
run;

proc print data=want;
run;

 

Patrick_0-1701863833971.png

 

 

zihdonv19
Quartz | Level 8

thanks! this is magic 🙂

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3249 views
  • 4 likes
  • 6 in conversation