- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear All,
I have dataset db1 including variables: id, start_date, end_date, and V1.
data db1;
input id start_date end_date V1;
datalines;
101 20130126 20130814 19
101 20130729 20130829 77
102 20130201 20130401 14
103 20130217 20130225 11
104 20130102 20131119 47
;
I also have dataset db2 including variables: id, date, and V2.
data db2;
input id date V2;
datalines;
101 20130719 26
103 20130224 55
105 20131107 10
;
I would like to create a variable FLAG in the dataset db1 such that FLAG = 1 if db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db1.end_date. That is, I would like to obtain the following
data want;
input id start_date end_date V1 flag;
datalines;
101 20130126 20130814 19 1
101 20130729 20130829 77 .
102 20130201 20130401 14 .
103 20130217 20130225 11 1
104 20130102 20131119 47 .
;
My attempt involves PROC SQL and I'm able to isolate the desired records. However, I'm not able to attach a flag in the original dataset db1
proc sql;
create table want
as select db1.*
from db1, db2
where db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db2.end_date;
quit;
Any help would be highly appreciated.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1)Hash Table data db1; input id start_date end_date V1; datalines; 101 20130126 20130814 19 101 20130729 20130829 77 102 20130201 20130401 14 103 20130217 20130225 11 104 20130102 20131119 47 ; data db2; input id date V2; datalines; 101 20130719 26 103 20130224 55 105 20131107 10 ; run; data want; if _n_ = 1 then do; if 0 then set db2; declare hash ha(dataset:'db2',hashexp:20); ha.definekey('id'); ha.definedata('date'); ha.definedone(); end; set db1; if ha.find()=0 then flag=ifn( start_date le date le end_date,1,. ); run; 2)Or if dataset have already been sorted , Merge statement data w; merge db1(in=ina) db2; by id; flag=ifn( start_date le date le end_date,1,. ); if ina; run;
Xia Keshan
Message was edited by: xia keshan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sql;
create table want as
select *, case exists (select * from db2 where id=a.id and date between a.start_date and a.end_date) when 1 then 1 else . end as flag
from db1 a
;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
proc sql;
create table WANT as
select DB1.*,
case when DB2.ID is not null then 1 else . end as FLAG
from DB1 DB1
left join DB1 DB2
on DB1.ID=DB2.ID
and DB1.START_DATE <= DB2.DATE <= DB1.END_DATE;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
proc sql;
select db1.*,case when db2.date >= db1.start_date and db2.date <= db1.end_date then 1
end as flag
from db1 left join db2
on db1.id = db2.id;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
, , @stat@sas, thank you for your replies.
I was wondering whether it is possible to gain a bit in efficiency because my dataset db1 has >30M records and db2 around 0.3 M. It's going to take days in this way (still running...)
Maybe knowing db1 is sorted on id and date and db2 is sorted on id and start_date, SAS could stop checking the condition for remaining records in db2 when db2.id > db1.id.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not thinking of an easy way. Whichever way you look at it your going to have to join the two datasets at some point. There are many tips and tricks on here to speed up things, e.g. break your 30m records down into smaller datasets (maybe on id or something). You can use indexes, get distinct list of ids which match and only do the check on those etc. However at the end of the 30m records is going to take a while regardless. Maybe also evaluate if adding a flag to the big dataset is the best way forward.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
proc sql;
create table want
as select db1.*,
case
when db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db1.end_date then 1
else .
end as flag
from db1 left join db2 on db1.id = db2.id;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1)Hash Table data db1; input id start_date end_date V1; datalines; 101 20130126 20130814 19 101 20130729 20130829 77 102 20130201 20130401 14 103 20130217 20130225 11 104 20130102 20131119 47 ; data db2; input id date V2; datalines; 101 20130719 26 103 20130224 55 105 20131107 10 ; run; data want; if _n_ = 1 then do; if 0 then set db2; declare hash ha(dataset:'db2',hashexp:20); ha.definekey('id'); ha.definedata('date'); ha.definedone(); end; set db1; if ha.find()=0 then flag=ifn( start_date le date le end_date,1,. ); run; 2)Or if dataset have already been sorted , Merge statement data w; merge db1(in=ina) db2; by id; flag=ifn( start_date le date le end_date,1,. ); if ina; run;
Xia Keshan
Message was edited by: xia keshan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Dear , your solutions are super fast and efficient. Thank you for your precious help.
May I ask you one more thing? My dataset db2 doesn't have unique id as in the toy example. That is, suppose I have the following (just added the 2nd row of db2, so that db2 doesn't have unique ids now):
data db1;
input id start_date end_date V1;
datalines;
101 20130126 20130814 19
101 20130729 20130829 77
102 20130201 20130401 14
103 20130217 20130225 11
104 20130102 20131119 47
;
data db2;
input id date V2;
datalines;
101 20130719 26
101 20130730 31
103 20130224 55
105 20131107 10
;
run;
This is the result I get right now: the flag is correctly assigned to the 1st record of db1, but not to the 2nd row of db1. That is,
data want;
input id start_date end_date V1 flag;
datalines;
101 20130126 20130814 19 1
101 20130729 20130829 77 .
102 20130201 20130401 14 .
103 20130217 20130225 11 1
104 20130102 20131119 47 .
;
I would like to obtain the following:
data want;
input id start_date end_date V1 flag;
datalines;
101 20130126 20130814 19 1
101 20130729 20130829 77 1
102 20130201 20130401 14 .
103 20130217 20130225 11 1
104 20130102 20131119 47 .
;
Please note the flag 1 in the second row this time.
Thank you for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A little tweak on 's code should do:
data want;
if _n_ = 1 then do;
if 0 then set db2;
declare hash ha(dataset:'db2',hashexp:20, multidata:'y');
ha.definekey('id');
ha.definedata('date');
ha.definedone();
end;
set db1;
rc=ha.find();
if rc ne 0 then flag=.;
else
do while (rc=0);
flag=ifn( start_date le date le end_date,1,flag);
rc=ha.find_next();
end;
drop rc;
run;
Haikuo
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Mark,
HaiKuo has already done it for me .
data db1; input id start_date end_date V1; datalines; 101 20130126 20130814 19 101 20130729 20130829 77 102 20130201 20130401 14 103 20130217 20130225 11 104 20130102 20131119 47 ; data db2; input id date V2; datalines; 101 20130719 26 101 20130730 31 103 20130224 55 105 20131107 10 ; run; data want; if _n_ = 1 then do; if 0 then set db2; declare hash ha(dataset:'db2',hashexp:20,multidata:'y'); ha.definekey('id'); ha.definedata('date'); ha.definedone(); end; set db1; rc=ha.find(); do while(rc=0); if start_date le date le end_date then flag=1; rc=ha.find_next(); end; run;
Xia Keshan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you guys for all your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If 'id' in db2 is unique, and both tables are presorted by 'id', then the following will give you a boost on efficiency:
data db1;
input id start_date end_date V1;
datalines;
101 20130126 20130814 19
101 20130729 20130829 77
102 20130201 20130401 14
103 20130217 20130225 11
104 20130102 20131119 47
;
data db2;
input id date V2;
datalines;
101 20130719 26
103 20130224 55
105 20131107 10
;
data want;
merge db1 db2(drop=v2);
by id;
if START_DATE <= DATE <= END_DATE then flag=1;
run;
Well, if 'id' is not unique in db2, then we may need to resort to Hash object.
Good Luck,
Haikuo