BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
finans_sas
Quartz | Level 8

Dear SAS Community Members,

I have the dataset below:

id    term    project        start                  finish

1     10        200          20070118         20081228

1     10        197          20070324         20080624

1     10        201          20070104         20081203

1     11        200          20080101         20091124

1     11        201          20080104         20091203

2     .....

2

2

.

.

.

433

What I want to do is to pick the earliest start and latest finish date for each id-term pair:

id    term           start                  finish

1     10              20070104         20081228

1     11              20080101         20091203

2     ......

2

2

.

.

.

433

I am not quite sure how to choose these dates. I will really appreciate your help.

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

If your dates are valid sas dates this would work.

 

proc sql;
 create table want as
 select id, term, 
   min(start) as start, 
   max(finish) as finish
from have
 group by id, term;
quit;

 


Or from @Tom, here's an approach using PROC SUMMARY:

proc summary data=have nway ;
  class id term ;
  var start finish;
  output out=want min(start)=start max(finish)=finish;
run;

If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.

View solution in original post

12 REPLIES 12
Reeza
Super User

If your dates are valid sas dates this would work.

 

proc sql;
 create table want as
 select id, term, 
   min(start) as start, 
   max(finish) as finish
from have
 group by id, term;
quit;

 


Or from @Tom, here's an approach using PROC SUMMARY:

proc summary data=have nway ;
  class id term ;
  var start finish;
  output out=want min(start)=start max(finish)=finish;
run;

If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.

FriedEgg
SAS Employee

data have;

input id term project (start finish) (: yymmdd8.);

cards;

1     10        200          20070118         20081228

1     10        197          20070324         20080624

1     10        201          20070104         20081203

1     11        200          20080101         20091124

1     11        201          20080104         20091203

;

run;

proc sql;

create table want as

       select id,

              term,

              min(start) as start format=date9.,

              max(finish) as finish format=date9.

         from have

     group by id, term;

quit;

       

11004JAN200728DEC2008
11101JAN200803DEC2009
Haikuo
Onyx | Level 15

Here is another Data step approach (raw data stolen from Egg's post):

data have;

input id term project (start finish) (: yymmdd8.);

cards;

1     10        200          20070118         20081228

1     10        197          20070324         20080624

1     10        201          20070104         20081203

1     11        200          20080101         20091124

1     11        201          20080104         20091203

;

data want (rename=(_s=start _f=finish)) ;

do until (last.term);

  set have;

    by id term;

      _s=min(_s,start);

      _f=max(_f,finish);

end;

drop start finish project;

format _s _f date9.;

run;

       

proc print;run;

Haikuo

endofline
Obsidian | Level 7

Any thought as to how you'd handle with if there was gaps inbetween the dates and wanted to flag for that? 

finans_sas
Quartz | Level 8

Thank you so much for your kind help. Both methods work great! I actually have a related question.

Suppose I now have the following two datasets (unique id-term pairs):

set1

id term    start           finish       dummy

1 10     20030108 20050103     1

1 11     20050106 20061224     1

...

set2

obsid id     datex         amount

1        1  20040624     $500

2        1  20021212     $100

3        1  20050105     $214

4        1  20060303     $450

I would like to conditionally merge these datasets such that the merged dataset will look like the following (ids should be equal to ids and datex should be between start and finish):

obsid id     datex         amount  term  dummy

1        1  20040624     $500     10     1

2        1  20021212     $100      .       .

3        1  20050105     $214      .       .

4        1  20060303     $450      11    1

I actually created a cartesian product between these two sets and imposed that ids should be the same and datex should be between start and finish. Nevertheless, I need to deal with plus 100,000,000 observations. I was just curious if it was possible to merge these two datasets without creating a cartesian product. Would a data step solution be plausible?

I tried the code below, but it did not work (I am sure the syntax is awfully wrong).

data merge

merge set2 set1;

by id;

if start le datex le finish;

end;

Haikuo
Onyx | Level 15

I suggest that you initiate a new thread, as not many people will go through an 'answered' question.

So back to your topic, from what I can see, similarly you still can have two approaches:

data h1;

input id term   (start  finish) (:yymmdd8.) dummy;

cards;

1 10     20030108 20050103     1

1 11     20050106 20061224     1

;

data h2;

input obsid id     datex :yymmdd8.     amount: dollar5.0;

format datex yymmdd10. amount dollar5.0;

cards;

1        1  20040624     $500

2        1  20021212     $100

3        1  20050105     $214

4        1  20060303     $450

;

/*SQL*/

proc sql;

create table want as

      select a.*, b.term, b.dummy

           from h2 a

           left join

               h1 b

                  on a.id=b.id and start <= datex <= finish

order by obsid;

quit;

/*data step:hash()*/

data want;

  if _n_=1 then do;

     if 0 then set h1;

     dcl hash h1(dataset:'h1', multidata:'y');

     h1.definekey('id');

     h1.definedata(all:'y');

     h1.definedone();

  end;

  set h2;

    out=0;

    rc=h1.find();

    do while (rc=0);

       if start <= datex <= finish then do;out=1;output;leave;end;

       rc=h1.find_next();

    end;

    if out=0 then do; call missing(term,dummy);output;end;

    drop start finish out rc;

run;

Haikuo

TomKari
Onyx | Level 15

I suggest you follow

"Match case control large data base"

in SAS Enterprise Data Management & Integration and SAS Procedures (2 threads, same question), as the problem is identical. I suspect the solution to one will also solve the other.

TomKari
Onyx | Level 15

Two further questions:

Do any of the data ranges for a given ID overlap, or are the date ranges always exclusive of each other.

Is is possible for more than one record from set2 to match to a date range in set1, or will it always be zero or one record?

I take back my earlier comment about this being similar to "Match case...". The problem there is a very large combinatorial result, whereas I think your case is less difficult.

Tom

Ksharp
Super User

One hundred million observations? I want know how big your first dataset((unique id-term pairs) was ?

data h1;
input id term   (start  finish) (:yymmdd8.) dummy;
cards;
1 10     20030108 20050103     1
1 11     20050106 20061224     1
;
run;

data h2;
input obsid id     datex :yymmdd8.     amount: dollar5.0;
format datex yymmdd10. amount dollar5.0;
cards;
1        1  20040624     $500
2        1  20021212     $100
3        1  20050105     $214
4        1  20060303     $450
;
run;

data temp;
 set h1;
 do datex=start to finish ;
  output;
 end;
 drop start finish;
run;
data want;
 if _n_ eq 1 then do;
  if 0 then set temp;
  declare hash ha(hashexp:20,dataset:'temp');
   ha.definekey('id','datex');
   ha.definedata('term','dummy');
   ha.definedone();
 end;
set h2;
call missing(term , dummy);
rc=ha.find();
drop rc;
run;




Ksharp

finans_sas
Quartz | Level 8

Thank you Ksharp for your further help.I am amazed at how powerful SAS is in terms of providing numerous solutions to the same problem. One of my datasets has about 723 unique ids. The other one has about 150,000 observations with repeating ids across time.

Tom
Super User Tom
Super User

This is a good problem to solve with PROC SUMMARY.

proc summary data=have nway ;

  class id term ;

  var start finish;

  output out=want min(start)=start max(finish)=finish;

run;

If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.

finans_sas
Quartz | Level 8

Thank you Tom, this is really a neat solution. My data is already sorted on id term. I also thank Hai.kuo and TomKari for helping me out with my second question. I did the matching already and it works great.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 32976 views
  • 11 likes
  • 8 in conversation