Solved: Re: Choosing the Earliest and Latest Dates

finans_sas · Posted 09-25-2012 10:51 AM

Dear SAS Community Members,

I have the dataset below:

id term project start finish

1 10 200 20070118 20081228

1 10 197 20070324 20080624

1 10 201 20070104 20081203

1 11 200 20080101 20091124

1 11 201 20080104 20091203

2 .....

2

.

433

What I want to do is to pick the earliest start and latest finish date for each id-term pair:

id term start finish

1 10 20070104 20081228

1 11 20080101 20091203

2 ......

2

.

433

I am not quite sure how to choose these dates. I will really appreciate your help.

Thank you

Reeza · Posted 09-25-2012 10:58 AM

If your dates are valid sas dates this would work.

proc sql;
 create table want as
 select id, term, 
   min(start) as start, 
   max(finish) as finish
from have
 group by id, term;
quit;

Or from @Tom, here's an approach using PROC SUMMARY:

proc summary data=have nway ;
  class id term ;
  var start finish;
  output out=want min(start)=start max(finish)=finish;
run;

If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.

View solution in original post

Reeza · Posted 09-25-2012 10:58 AM

If your dates are valid sas dates this would work.

proc sql;
 create table want as
 select id, term, 
   min(start) as start, 
   max(finish) as finish
from have
 group by id, term;
quit;

Or from @Tom, here's an approach using PROC SUMMARY:

proc summary data=have nway ;
  class id term ;
  var start finish;
  output out=want min(start)=start max(finish)=finish;
run;

If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.

FriedEgg · Posted 09-25-2012 11:01 AM

data have;

input id term project (start finish) (: yymmdd8.);

cards;

1 10 200 20070118 20081228

1 10 197 20070324 20080624

1 10 201 20070104 20081203

1 11 200 20080101 20091124

1 11 201 20080104 20091203

;

run;

proc sql;

create table want as

select id,

term,

min(start) as start format=date9.,

max(finish) as finish format=date9.

from have

group by id, term;

quit;

1	10	04JAN2007	28DEC2008
1	11	01JAN2008	03DEC2009

Haikuo · Posted 09-25-2012 11:49 AM

Here is another Data step approach (raw data stolen from Egg's post):

data have;

input id term project (start finish) (: yymmdd8.);

cards;

1 10 200 20070118 20081228

1 10 197 20070324 20080624

1 10 201 20070104 20081203

1 11 200 20080101 20091124

1 11 201 20080104 20091203

;

data want (rename=(_s=start _f=finish)) ;

do until (last.term);

set have;

by id term;

_s=min(_s,start);

_f=max(_f,finish);

end;

drop start finish project;

format _s _f date9.;

run;

proc print;run;

Haikuo

endofline · Posted 06-23-2017 03:59 PM

Any thought as to how you'd handle with if there was gaps inbetween the dates and wanted to flag for that?

finans_sas · Posted 09-25-2012 01:21 PM

Thank you so much for your kind help. Both methods work great! I actually have a related question.

Suppose I now have the following two datasets (unique id-term pairs):

set1

id term start finish dummy

1 10 20030108 20050103 1

1 11 20050106 20061224 1

...

set2

obsid id datex amount

1 1 20040624 $500

2 1 20021212 $100

3 1 20050105 $214

4 1 20060303 $450

I would like to conditionally merge these datasets such that the merged dataset will look like the following (ids should be equal to ids and datex should be between start and finish):

obsid id datex amount term dummy

1 1 20040624 $500 10 1

2 1 20021212 $100 . .

3 1 20050105 $214 . .

4 1 20060303 $450 11 1

I actually created a cartesian product between these two sets and imposed that ids should be the same and datex should be between start and finish. Nevertheless, I need to deal with plus 100,000,000 observations. I was just curious if it was possible to merge these two datasets without creating a cartesian product. Would a data step solution be plausible?

I tried the code below, but it did not work (I am sure the syntax is awfully wrong).

data merge

merge set2 set1;

by id;

if start le datex le finish;

end;

Haikuo · Posted 09-25-2012 08:34 PM

I suggest that you initiate a new thread, as not many people will go through an 'answered' question.

So back to your topic, from what I can see, similarly you still can have two approaches:

data h1;

input id term (start finish) (:yymmdd8.) dummy;

cards;

1 10 20030108 20050103 1

1 11 20050106 20061224 1

;

data h2;

input obsid id datex :yymmdd8. amount: dollar5.0;

format datex yymmdd10. amount dollar5.0;

cards;

1 1 20040624 $500

2 1 20021212 $100

3 1 20050105 $214

4 1 20060303 $450

;

/*SQL*/

proc sql;

create table want as

select a.*, b.term, b.dummy

from h2 a

left join

h1 b

on a.id=b.id and start <= datex <= finish

order by obsid;

quit;

/*data step:hash()*/

data want;

if _n_=1 then do;

if 0 then set h1;

dcl hash h1(dataset:'h1', multidata:'y');

h1.definekey('id');

h1.definedata(all:'y');

h1.definedone();

end;

set h2;

out=0;

rc=h1.find();

do while (rc=0);

if start <= datex <= finish then do;out=1;output;leave;end;

rc=h1.find_next();

end;

if out=0 then do; call missing(term,dummy);output;end;

drop start finish out rc;

run;

Haikuo

TomKari · Posted 09-26-2012 01:43 PM

I suggest you follow

"Match case control large data base"

in SAS Enterprise Data Management & Integration and SAS Procedures (2 threads, same question), as the problem is identical. I suspect the solution to one will also solve the other.

TomKari · Posted 09-26-2012 05:19 PM

Two further questions:

Do any of the data ranges for a given ID overlap, or are the date ranges always exclusive of each other.

Is is possible for more than one record from set2 to match to a date range in set1, or will it always be zero or one record?

I take back my earlier comment about this being similar to "Match case...". The problem there is a very large combinatorial result, whereas I think your case is less difficult.

Tom

Ksharp · Posted 09-25-2012 10:41 PM

One hundred million observations? I want know how big your first dataset((unique id-term pairs) was ?

data h1;
input id term   (start  finish) (:yymmdd8.) dummy;
cards;
1 10     20030108 20050103     1
1 11     20050106 20061224     1
;
run;

data h2;
input obsid id     datex :yymmdd8.     amount: dollar5.0;
format datex yymmdd10. amount dollar5.0;
cards;
1        1  20040624     $500
2        1  20021212     $100
3        1  20050105     $214
4        1  20060303     $450
;
run;

data temp;
 set h1;
 do datex=start to finish ;
  output;
 end;
 drop start finish;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set temp;
  declare hash ha(hashexp:20,dataset:'temp');
   ha.definekey('id','datex');
   ha.definedata('term','dummy');
   ha.definedone();
 end;
set h2;
call missing(term , dummy);
rc=ha.find();
drop rc;
run;

Ksharp

finans_sas · Posted 09-26-2012 09:11 PM

Thank you Ksharp for your further help.I am amazed at how powerful SAS is in terms of providing numerous solutions to the same problem. One of my datasets has about 723 unique ids. The other one has about 150,000 observations with repeating ids across time.

Tom · Posted 09-26-2012 12:20 AM

This is a good problem to solve with PROC SUMMARY.

proc summary data=have nway ;

class id term ;

var start finish;

output out=want min(start)=start max(finish)=finish;

run;

If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.

finans_sas · Posted 09-26-2012 09:12 PM

Thank you Tom, this is really a neat solution. My data is already sorted on id term. I also thank Hai.kuo and TomKari for helping me out with my second question. I did the matching already and it works great.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away