Help using Base SAS procedures

Choosing the Earliest and Latest Dates

Accepted Solution Solved
Reply
Contributor
Posts: 55
Accepted Solution

Choosing the Earliest and Latest Dates

Dear SAS Community Members,

I have the dataset below:

id    term    project        start                  finish

1     10        200          20070118         20081228

1     10        197          20070324         20080624

1     10        201          20070104         20081203

1     11        200          20080101         20091124

1     11        201          20080104         20091203

2     .....

2

2

.

.

.

433

What I want to do is to pick the earliest start and latest finish date for each id-term pair:

id    term           start                  finish

1     10              20070104         20081228

1     11              20080101         20091203

2     ......

2

2

.

.

.

433

I am not quite sure how to choose these dates. I will really appreciate your help.

Thank you


Accepted Solutions
Solution
‎09-25-2012 10:58 AM
Super User
Posts: 17,784

Re: Choosing the Earliest and Latest Dates

If your dates are valid sas dates this would work.

proc sql;

create table want as

select id, term, min(start) as start, max(finish) as finish

from have

group by id, term;

quit;

View solution in original post


All Replies
Solution
‎09-25-2012 10:58 AM
Super User
Posts: 17,784

Re: Choosing the Earliest and Latest Dates

If your dates are valid sas dates this would work.

proc sql;

create table want as

select id, term, min(start) as start, max(finish) as finish

from have

group by id, term;

quit;

Trusted Advisor
Posts: 1,300

Re: Choosing the Earliest and Latest Dates

data have;

input id term project (start finish) (: yymmdd8.);

cards;

1     10        200          20070118         20081228

1     10        197          20070324         20080624

1     10        201          20070104         20081203

1     11        200          20080101         20091124

1     11        201          20080104         20091203

;

run;

proc sql;

create table want as

       select id,

              term,

              min(start) as start format=date9.,

              max(finish) as finish format=date9.

         from have

     group by id, term;

quit;

       

11004JAN200728DEC2008
11101JAN200803DEC2009
Respected Advisor
Posts: 3,124

Re: Choosing the Earliest and Latest Dates

Here is another Data step approach (raw data stolen from Egg's post):

data have;

input id term project (start finish) (: yymmdd8.);

cards;

1     10        200          20070118         20081228

1     10        197          20070324         20080624

1     10        201          20070104         20081203

1     11        200          20080101         20091124

1     11        201          20080104         20091203

;

data want (rename=(_s=start _f=finish)) ;

do until (last.term);

  set have;

    by id term;

      _s=min(_s,start);

      _f=max(_f,finish);

end;

drop start finish project;

format _s _f date9.;

run;

       

proc print;run;

Haikuo

Contributor
Posts: 27

Re: Choosing the Earliest and Latest Dates

Any thought as to how you'd handle with if there was gaps inbetween the dates and wanted to flag for that? 

Contributor
Posts: 55

Re: Choosing the Earliest and Latest Dates

Thank you so much for your kind help. Both methods work great! I actually have a related question.

Suppose I now have the following two datasets (unique id-term pairs):

set1

id term    start           finish       dummy

1 10     20030108 20050103     1

1 11     20050106 20061224     1

...

set2

obsid id     datex         amount

1        1  20040624     $500

2        1  20021212     $100

3        1  20050105     $214

4        1  20060303     $450

I would like to conditionally merge these datasets such that the merged dataset will look like the following (ids should be equal to ids and datex should be between start and finish):

obsid id     datex         amount  term  dummy

1        1  20040624     $500     10     1

2        1  20021212     $100      .       .

3        1  20050105     $214      .       .

4        1  20060303     $450      11    1

I actually created a cartesian product between these two sets and imposed that ids should be the same and datex should be between start and finish. Nevertheless, I need to deal with plus 100,000,000 observations. I was just curious if it was possible to merge these two datasets without creating a cartesian product. Would a data step solution be plausible?

I tried the code below, but it did not work (I am sure the syntax is awfully wrong).

data merge

merge set2 set1;

by id;

if start le datex le finish;

end;

Respected Advisor
Posts: 3,124

Re: Choosing the Earliest and Latest Dates

I suggest that you initiate a new thread, as not many people will go through an 'answered' question.

So back to your topic, from what I can see, similarly you still can have two approaches:

data h1;

input id term   (start  finish) (:yymmdd8.) dummy;

cards;

1 10     20030108 20050103     1

1 11     20050106 20061224     1

;

data h2;

input obsid id     datex :yymmdd8.     amount: dollar5.0;

format datex yymmdd10. amount dollar5.0;

cards;

1        1  20040624     $500

2        1  20021212     $100

3        1  20050105     $214

4        1  20060303     $450

;

/*SQL*/

proc sql;

create table want as

      select a.*, b.term, b.dummy

           from h2 a

           left join

               h1 b

                  on a.id=b.id and start <= datex <= finish

order by obsid;

quit;

/*data step:hash()*/

data want;

  if _n_=1 then do;

     if 0 then set h1;

     dcl hash h1(dataset:'h1', multidata:'y');

     h1.definekey('id');

     h1.definedata(all:'y');

     h1.definedone();

  end;

  set h2;

    out=0;

    rc=h1.find();

    do while (rc=0);

       if start <= datex <= finish then do;out=1;output;leave;end;

       rc=h1.find_next();

    end;

    if out=0 then do; call missing(term,dummy);output;end;

    drop start finish out rc;

run;

Haikuo

PROC Star
Posts: 1,090

Re: Choosing the Earliest and Latest Dates

I suggest you follow

"Match case control large data base"

in SAS Enterprise Data Management & Integration and SAS Procedures (2 threads, same question), as the problem is identical. I suspect the solution to one will also solve the other.

PROC Star
Posts: 1,090

Re: Choosing the Earliest and Latest Dates

Two further questions:

Do any of the data ranges for a given ID overlap, or are the date ranges always exclusive of each other.

Is is possible for more than one record from set2 to match to a date range in set1, or will it always be zero or one record?

I take back my earlier comment about this being similar to "Match case...". The problem there is a very large combinatorial result, whereas I think your case is less difficult.

Tom

Super User
Posts: 9,676

Re: Choosing the Earliest and Latest Dates

One hundred million observations? I want know how big your first dataset((unique id-term pairs) was ?

data h1;
input id term   (start  finish) (:yymmdd8.) dummy;
cards;
1 10     20030108 20050103     1
1 11     20050106 20061224     1
;
run;

data h2;
input obsid id     datex :yymmdd8.     amount: dollar5.0;
format datex yymmdd10. amount dollar5.0;
cards;
1        1  20040624     $500
2        1  20021212     $100
3        1  20050105     $214
4        1  20060303     $450
;
run;

data temp;
 set h1;
 do datex=start to finish ;
  output;
 end;
 drop start finish;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set temp;
  declare hash ha(hashexp:20,dataset:'temp');
   ha.definekey('id','datex');
   ha.definedata('term','dummy');
   ha.definedone();
 end;
set h2;
call missing(term , dummy);
rc=ha.find();
drop rc;
run;




Ksharp

Contributor
Posts: 55

Re: Choosing the Earliest and Latest Dates

Thank you Ksharp for your further help.I am amazed at how powerful SAS is in terms of providing numerous solutions to the same problem. One of my datasets has about 723 unique ids. The other one has about 150,000 observations with repeating ids across time.

Super User
Super User
Posts: 6,499

Re: Choosing the Earliest and Latest Dates

This is a good problem to solve with PROC SUMMARY.

proc summary data=have nway ;

  class id term ;

  var start finish;

  output out=want min(start)=start max(finish)=finish;

run;

If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.

Contributor
Posts: 55

Re: Choosing the Earliest and Latest Dates

Thank you Tom, this is really a neat solution. My data is already sorted on id term. I also thank Hai.kuo and TomKari for helping me out with my second question. I did the matching already and it works great.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 7357 views
  • 10 likes
  • 8 in conversation