Dear SAS Community Members,
I have the dataset below:
id term project start finish
1 10 200 20070118 20081228
1 10 197 20070324 20080624
1 10 201 20070104 20081203
1 11 200 20080101 20091124
1 11 201 20080104 20091203
2 .....
2
2
.
.
.
433
What I want to do is to pick the earliest start and latest finish date for each id-term pair:
id term start finish
1 10 20070104 20081228
1 11 20080101 20091203
2 ......
2
2
.
.
.
433
I am not quite sure how to choose these dates. I will really appreciate your help.
Thank you
If your dates are valid sas dates this would work.
proc sql;
create table want as
select id, term,
min(start) as start,
max(finish) as finish
from have
group by id, term;
quit;
Or from @Tom, here's an approach using PROC SUMMARY:
proc summary data=have nway ;
class id term ;
var start finish;
output out=want min(start)=start max(finish)=finish;
run;
If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.
If your dates are valid sas dates this would work.
proc sql;
create table want as
select id, term,
min(start) as start,
max(finish) as finish
from have
group by id, term;
quit;
Or from @Tom, here's an approach using PROC SUMMARY:
proc summary data=have nway ;
class id term ;
var start finish;
output out=want min(start)=start max(finish)=finish;
run;
If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.
data have;
input id term project (start finish) (: yymmdd8.);
cards;
1 10 200 20070118 20081228
1 10 197 20070324 20080624
1 10 201 20070104 20081203
1 11 200 20080101 20091124
1 11 201 20080104 20091203
;
run;
proc sql;
create table want as
select id,
term,
min(start) as start format=date9.,
max(finish) as finish format=date9.
from have
group by id, term;
quit;
1 | 10 | 04JAN2007 | 28DEC2008 |
1 | 11 | 01JAN2008 | 03DEC2009 |
Here is another Data step approach (raw data stolen from Egg's post):
data have;
input id term project (start finish) (: yymmdd8.);
cards;
1 10 200 20070118 20081228
1 10 197 20070324 20080624
1 10 201 20070104 20081203
1 11 200 20080101 20091124
1 11 201 20080104 20091203
;
data want (rename=(_s=start _f=finish)) ;
do until (last.term);
set have;
by id term;
_s=min(_s,start);
_f=max(_f,finish);
end;
drop start finish project;
format _s _f date9.;
run;
proc print;run;
Haikuo
Any thought as to how you'd handle with if there was gaps inbetween the dates and wanted to flag for that?
Thank you so much for your kind help. Both methods work great! I actually have a related question.
Suppose I now have the following two datasets (unique id-term pairs):
set1
id term start finish dummy
1 10 20030108 20050103 1
1 11 20050106 20061224 1
...
set2
obsid id datex amount
1 1 20040624 $500
2 1 20021212 $100
3 1 20050105 $214
4 1 20060303 $450
I would like to conditionally merge these datasets such that the merged dataset will look like the following (ids should be equal to ids and datex should be between start and finish):
obsid id datex amount term dummy
1 1 20040624 $500 10 1
2 1 20021212 $100 . .
3 1 20050105 $214 . .
4 1 20060303 $450 11 1
I actually created a cartesian product between these two sets and imposed that ids should be the same and datex should be between start and finish. Nevertheless, I need to deal with plus 100,000,000 observations. I was just curious if it was possible to merge these two datasets without creating a cartesian product. Would a data step solution be plausible?
I tried the code below, but it did not work (I am sure the syntax is awfully wrong).
data merge
merge set2 set1;
by id;
if start le datex le finish;
end;
I suggest that you initiate a new thread, as not many people will go through an 'answered' question.
So back to your topic, from what I can see, similarly you still can have two approaches:
data h1;
input id term (start finish) (:yymmdd8.) dummy;
cards;
1 10 20030108 20050103 1
1 11 20050106 20061224 1
;
data h2;
input obsid id datex :yymmdd8. amount: dollar5.0;
format datex yymmdd10. amount dollar5.0;
cards;
1 1 20040624 $500
2 1 20021212 $100
3 1 20050105 $214
4 1 20060303 $450
;
/*SQL*/
proc sql;
create table want as
select a.*, b.term, b.dummy
from h2 a
left join
h1 b
on a.id=b.id and start <= datex <= finish
order by obsid;
quit;
/*data step:hash()*/
data want;
if _n_=1 then do;
if 0 then set h1;
dcl hash h1(dataset:'h1', multidata:'y');
h1.definekey('id');
h1.definedata(all:'y');
h1.definedone();
end;
set h2;
out=0;
rc=h1.find();
do while (rc=0);
if start <= datex <= finish then do;out=1;output;leave;end;
rc=h1.find_next();
end;
if out=0 then do; call missing(term,dummy);output;end;
drop start finish out rc;
run;
Haikuo
I suggest you follow
"Match case control large data base"
in SAS Enterprise Data Management & Integration and SAS Procedures (2 threads, same question), as the problem is identical. I suspect the solution to one will also solve the other.
Two further questions:
Do any of the data ranges for a given ID overlap, or are the date ranges always exclusive of each other.
Is is possible for more than one record from set2 to match to a date range in set1, or will it always be zero or one record?
I take back my earlier comment about this being similar to "Match case...". The problem there is a very large combinatorial result, whereas I think your case is less difficult.
Tom
One hundred million observations? I want know how big your first dataset((unique id-term pairs) was ?
data h1; input id term (start finish) (:yymmdd8.) dummy; cards; 1 10 20030108 20050103 1 1 11 20050106 20061224 1 ; run; data h2; input obsid id datex :yymmdd8. amount: dollar5.0; format datex yymmdd10. amount dollar5.0; cards; 1 1 20040624 $500 2 1 20021212 $100 3 1 20050105 $214 4 1 20060303 $450 ; run; data temp; set h1; do datex=start to finish ; output; end; drop start finish; run; data want; if _n_ eq 1 then do; if 0 then set temp; declare hash ha(hashexp:20,dataset:'temp'); ha.definekey('id','datex'); ha.definedata('term','dummy'); ha.definedone(); end; set h2; call missing(term , dummy); rc=ha.find(); drop rc; run;
Ksharp
Thank you Ksharp for your further help.I am amazed at how powerful SAS is in terms of providing numerous solutions to the same problem. One of my datasets has about 723 unique ids. The other one has about 150,000 observations with repeating ids across time.
This is a good problem to solve with PROC SUMMARY.
proc summary data=have nway ;
class id term ;
var start finish;
output out=want min(start)=start max(finish)=finish;
run;
If the data is already sorted by ID TERM then your query will probably run faster and use less memory if you change the CLASS statement to a BY statement.
Thank you Tom, this is really a neat solution. My data is already sorted on id term. I also thank Hai.kuo and TomKari for helping me out with my second question. I did the matching already and it works great.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.