BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mark_ph
Calcite | Level 5

Dear All,

I have dataset db1 including variables: id, start_date, end_date, and V1.

data db1;

     input id    start_date     end_date     V1;

datalines;

101     20130126     20130814     19     

101     20130729     20130829     77

102     20130201     20130401     14

103     20130217     20130225     11

104     20130102     20131119     47

;

I also have dataset db2 including variables: id, date, and V2.


data db2;

     input id     date     V2;

datalines;

101     20130719     26     

103     20130224     55

105     20131107     10

;


I would like to create a variable FLAG in the dataset db1 such that FLAG = 1 if db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db1.end_date. That is, I would like to obtain the following


data want;

     input id    start_date     end_date     V1     flag;

datalines;

101     20130126     20130814     19     1     

101     20130729     20130829     77     .

102     20130201     20130401     14     .    

103     20130217     20130225     11     1

104     20130102     20131119     47     .

;

My attempt involves PROC SQL and I'm able to isolate the desired records. However, I'm not able to attach a flag in the original dataset db1


proc sql;

  create table want

  as select db1.*

  from db1, db2

  where db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db2.end_date;

quit;

Any help would be highly appreciated.


1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
1)Hash Table




data db1;
     input id    start_date     end_date     V1;
datalines;
101     20130126     20130814     19     
101     20130729     20130829     77
102     20130201     20130401     14
103     20130217     20130225     11
104     20130102     20131119     47
;
data db2;
     input id     date     V2;
datalines;
101     20130719     26     
103     20130224     55
105     20131107     10
;
run;

data want;
 if _n_ = 1 then do;
 if 0 then set db2;
 declare hash ha(dataset:'db2',hashexp:20);
 ha.definekey('id');
 ha.definedata('date');
 ha.definedone();
end;
set db1;
if ha.find()=0 then flag=ifn( start_date le date le end_date,1,. );
run;











2)Or if dataset have already been sorted , Merge statement 




data w;
 merge db1(in=ina) db2;
 by id;
flag=ifn( start_date le date le end_date,1,. );
if ina;
run;

Xia Keshan

Message was edited by: xia keshan

View solution in original post

13 REPLIES 13
Haikuo
Onyx | Level 15

proc sql;

create table want as

  select *, case exists (select * from db2 where id=a.id and date between a.start_date and a.end_date) when 1 then 1 else . end as flag

  from db1 a

  ;

quit;

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

proc sql;

     create table WANT as

     select     DB1.*,

                   case when DB2.ID is not null then 1 else . end as FLAG

     from       DB1 DB1

     left join   DB1 DB2    

     on          DB1.ID=DB2.ID

     and        DB1.START_DATE <= DB2.DATE <= DB1.END_DATE;

quit;

stat_sas
Ammonite | Level 13

proc sql;

select db1.*,case when db2.date >= db1.start_date and db2.date <= db1.end_date then 1

end as flag

  from db1 left join db2

  on db1.id = db2.id;

quit;

mark_ph
Calcite | Level 5

, , @stat@sas, thank you for your replies.

I was wondering whether it is possible to gain a bit in efficiency because my dataset db1 has >30M records and db2 around 0.3 M. It's going to take days in this way (still running...)

Maybe knowing db1 is sorted on id and date and db2 is sorted on id and start_date, SAS could stop checking the condition for remaining records in db2 when db2.id > db1.id.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

I'm not thinking of an easy way.  Whichever way you look at it your going to have to join the two datasets at some point.  There are many tips and tricks on here to speed up things, e.g. break your 30m records down into smaller datasets (maybe on id or something).  You can use indexes, get distinct list of ids which match and only do the check on those etc.  However at the end of the 30m records is going to take a while regardless.  Maybe also evaluate if adding a flag to the big dataset is the best way forward.

Loko
Barite | Level 11

Hello,

proc sql;

  create table want

  as select db1.*,

  case

  when db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db1.end_date then 1

  else .

  end as flag

  from db1 left join db2 on db1.id = db2.id;

quit;

Ksharp
Super User
1)Hash Table




data db1;
     input id    start_date     end_date     V1;
datalines;
101     20130126     20130814     19     
101     20130729     20130829     77
102     20130201     20130401     14
103     20130217     20130225     11
104     20130102     20131119     47
;
data db2;
     input id     date     V2;
datalines;
101     20130719     26     
103     20130224     55
105     20131107     10
;
run;

data want;
 if _n_ = 1 then do;
 if 0 then set db2;
 declare hash ha(dataset:'db2',hashexp:20);
 ha.definekey('id');
 ha.definedata('date');
 ha.definedone();
end;
set db1;
if ha.find()=0 then flag=ifn( start_date le date le end_date,1,. );
run;











2)Or if dataset have already been sorted , Merge statement 




data w;
 merge db1(in=ina) db2;
 by id;
flag=ifn( start_date le date le end_date,1,. );
if ina;
run;

Xia Keshan

Message was edited by: xia keshan

mark_ph
Calcite | Level 5

Dear , your solutions are super fast and efficient. Thank you for your precious help.

May I ask you one more thing? My dataset db2 doesn't have unique id as in the toy example. That is, suppose I have the following (just added the 2nd row of db2, so that db2 doesn't have unique ids now):

data db1;

     input id    start_date     end_date     V1;

datalines;

101     20130126     20130814     19   

101     20130729     20130829     77

102     20130201     20130401     14

103     20130217     20130225     11

104     20130102     20131119     47

;

data db2;

     input id     date     V2;

datalines;

101     20130719     26

101     20130730     31

103     20130224     55

105     20131107     10

;

run;

This is the result I get right now: the flag is correctly assigned to the 1st record of db1, but not to the 2nd row of db1. That is,


data want;

     input id    start_date     end_date     V1     flag;

datalines;

101     20130126     20130814     19     1    

101     20130729     20130829     77     .

102     20130201     20130401     14     .   

103     20130217     20130225     11     1

104     20130102     20131119     47     .

;


I would like to obtain the following:

data want;

     input id    start_date     end_date     V1     flag;

datalines;

101     20130126     20130814     19     1    

101     20130729     20130829     77     1

102     20130201     20130401     14     .   

103     20130217     20130225     11     1

104     20130102     20131119     47     .

;


Please note the flag 1 in the second row this time.


Thank you for your help.

Haikuo
Onyx | Level 15

A little tweak on 's code should do:

data want;

if _n_ = 1 then do;

if 0 then set db2;

declare hash ha(dataset:'db2',hashexp:20, multidata:'y');

ha.definekey('id');

ha.definedata('date');

ha.definedone();

end;

set db1;

rc=ha.find();

if rc ne 0 then flag=.;

else

do  while (rc=0);

flag=ifn( start_date le date le end_date,1,flag);

rc=ha.find_next();

end;

drop rc;

run;

Haikuo

mark_ph
Calcite | Level 5

Dear , thank you so much for your help!!

Ksharp
Super User

Hi Mark,

HaiKuo has already done it for me .

data db1;
     input id    start_date     end_date     V1;
datalines;
101     20130126     20130814     19     
101     20130729     20130829     77
102     20130201     20130401     14
103     20130217     20130225     11
104     20130102     20131119     47
;
data db2;
     input id     date     V2;
datalines;
101     20130719     26
101     20130730     31
103     20130224     55
105     20131107     10
;
run;

data want;
 if _n_ = 1 then do;
 if 0 then set db2;
 declare hash ha(dataset:'db2',hashexp:20,multidata:'y');
 ha.definekey('id');
 ha.definedata('date');
 ha.definedone();
end;
set db1;
rc=ha.find();
do while(rc=0);
  if start_date le date le end_date then flag=1;
  rc=ha.find_next();
end;
run;



Xia Keshan

mark_ph
Calcite | Level 5

Thank you guys for all your help.

Haikuo
Onyx | Level 15

If 'id' in db2 is unique, and both tables are presorted by 'id', then the following will give you a boost on efficiency:

data db1;

input id    start_date     end_date     V1;

datalines;

101     20130126     20130814     19

101     20130729     20130829     77

102     20130201     20130401     14

103     20130217     20130225     11

104     20130102     20131119     47

;

data db2;

input id     date     V2;

datalines;

101     20130719     26

103     20130224     55

105     20131107     10

;

data want;

merge db1 db2(drop=v2);

by id;

     if START_DATE <= DATE <= END_DATE then flag=1;

run;

Well, if 'id' is not unique in db2, then we may need to resort to Hash object.

Good Luck,

Haikuo

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 23110 views
  • 6 likes
  • 6 in conversation