BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Alice_SAS
Calcite | Level 5

I currently have two tables have1 and have2, and I want to determine if the YEAR corresponding to each NAME in have2 is within the time range of have1 [START END]. If marked as 1, otherwise it is 0. Any part of the YEAR can be within the time range.

I hope to match through the common CODE of two tables and obtain Want.

Thank you for your help!

1 ACCEPTED SOLUTION

Accepted Solutions
andreas_lds
Jade | Level 19

Depending on the size of have1 using a hash object could solve the problem:

data work.want;
   set work.have2;

   length flag more 8;

   if _n_ = 1 then do;
      if 0 then set work.have1;

      declare hash h(dataset: 'work.have1', multidata: 'yes');
      h.defineKey('name', 'code');
      h.defineData('start', 'end');
      h.defineDone();

      call missing(start, end);
   end;

   flag = 0;

   if h.find() = 0 then do;
      flag = year(start) <= year <= year(end);
      h.has_next(result: more);

      do while (more and not flag);
         rc = h.find_next();
         flag = (year(start) <= year <= year(end)) or flag;
         h.has_next(result: more);
      end;
   end;

   drop start end rc more;
run;

View solution in original post

10 REPLIES 10
Alice_SAS
Calcite | Level 5

Here is the data:

 

data have1;
input start: YYMMDD8. end: YYMMDD8. name:$1. code;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
20050712 20060531 A 83265
20070531 20080531 A 83265
20100624 20130812 B 79080
20090620 20110912 B 79080
20040526 20051109 C 35423
20051109 20100831 C 35423
20100831 20110930 C 35423
20140401 20160418 D 69105
20170419 20180424 E 53272
;
run;


data have2;
input year name:$1. code;
datalines;
2011 A 83265
2012 A 83265
2007 B 79080
2008 B 79080
2009 B 79080
2010 B 79080
2012 D 69105
2013 D 69105
2014 D 69105
2015 D 69105
2016 D 69105
;
run;

data want;
input year name:$1. code exist;
datalines;
2011 A 83265 0
2012 A 83265 0
2007 B 79080 0
2008 B 79080 0
2009 B 79080 1
2010 B 79080 1
2012 D 69105 0
2013 D 69105 0
2014 D 69105 1
2015 D 69105 1
2016 D 69105 1
;
run;

Alice_SAS
Calcite | Level 5

Yes. But they are in different situations, especially for have1. Each NAME and CODE corresponds to different time intervals and needs to be concatenated for judgment. The previous code cannot be implemented in this situation. 

PaigeMiller
Diamond | Level 26

In the other thread, you mentioned that you have really huge data sets, and that the solution presented didn't work because you ran out of memory. Is that the case here as well?

--
Paige Miller
Alice_SAS
Calcite | Level 5

Yes. HAVE2 has over 60000 rows and 20 columns, while HAVE1 has 1.5 million rows and 10 columns.

s_lassen
Meteorite | Level 14

Here is a solution that uses indexed lookup - the index on the HAVE1 dataset should probably be created in the program that creates the table, not in the lookup program:


proc sql;
  create index idx on have1(name,code);
quit;

data want;
  set have2;
  exist=0;
  do name=name,' '; /* initialize for next read by calling with non-existing key, as keys repeat in HAVE2 */
    do until(exist);
      set have1 key=idx;
      if _iorc_ then do;
        _error_=0;
        leave;
        end;
      exist=year(start)<=year<=year(end);
      end;
      if name ne ' ' then output;
    end;
  keep year name code exist;
run;

The good thing about this approach, when HAVE2 is the smaller table, is that we only access the records in HAVE1 that are actually relevant.

s_lassen
Meteorite | Level 14

Here is another possibility, which requires sorting the big HAVE1 table. We can then create a view (or a table, but a view is probably faster in this case) containing the possible years in the correct order (this is of course a very good solution if your actual data, unlike your demonstration data, is actually sorted already):


proc sort data=have1;
  by name code start;
run;

data have1_years/view=have1_years;
  do until(last.code);
    set have1; 
    by name code;
    do year=max(year+1,year(start)) to year(end);
      output;
      end;
    end;
  keep name code year;
run;
 

data want;
  merge have1_years(in=in1) have2(in=in2);
  by name code year;
  if in2;
  exist=in1;
run;

The solution can also be done extracting NAME, CODE and YEAR from HAVE1, and then sort the output data afterwards. This may be a better solution if you have many columns in HAVE1:

data have1_years;
  set have1; 
  do year=year(start) to year(end);
    output;
    end;
  keep name code year;
run;

Proc sort nodupkey;
  by name code year;
run;
ballardw
Super User

@Alice_SAS wrote:

Yes. But they are in different situations, especially for have1. Each NAME and CODE corresponds to different time intervals and needs to be concatenated for judgment. The previous code cannot be implemented in this situation. 


So just where is your statement of how intervals need to be concatenated and what rule(s) are applied for "judgement". I don't see anything about either action in the original problem description.

I currently have two tables have1 and have2, and I want to determine if the YEAR corresponding to each NAME in have2 is within the time range of have1 [START END]. If marked as 1, otherwise it is 0. Any part of the YEAR can be within the time range.

I hope to match through the common CODE of two tables and obtain Want.

Thank you for your help!

Or in the post with the example data.

If a previous solution does not work, you should post a reference to that "solution" and describe what has changed that makes it no longer suitable.

 

andreas_lds
Jade | Level 19

Depending on the size of have1 using a hash object could solve the problem:

data work.want;
   set work.have2;

   length flag more 8;

   if _n_ = 1 then do;
      if 0 then set work.have1;

      declare hash h(dataset: 'work.have1', multidata: 'yes');
      h.defineKey('name', 'code');
      h.defineData('start', 'end');
      h.defineDone();

      call missing(start, end);
   end;

   flag = 0;

   if h.find() = 0 then do;
      flag = year(start) <= year <= year(end);
      h.has_next(result: more);

      do while (more and not flag);
         rc = h.find_next();
         flag = (year(start) <= year <= year(end)) or flag;
         h.has_next(result: more);
      end;
   end;

   drop start end rc more;
run;
Tom
Super User Tom
Super User

What does it mean for a year to be within a time period?

Do you mean that the whole year is covered by the time period?

Period Start -------- End
Year          J---D
start <= mdy(1,1,year) and mdy(12,31,year) <= end

Or does only part of the year need to be covered?

Period       Start -------- End
Year       J-----D
Year                       J-----D
start <= mdy(12,31,year) and mdy(1,1,year) <= end

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1117 views
  • 2 likes
  • 6 in conversation