Solved: Re: Adding flag column when condition is met

mark_ph · Posted 09-25-2014 09:43 AM

Dear All,

I have dataset db1 including variables: id, start_date, end_date, and V1.

data db1;

input id start_date end_date V1;

datalines;

101 20130126 20130814 19

101 20130729 20130829 77

102 20130201 20130401 14

103 20130217 20130225 11

104 20130102 20131119 47

;

I also have dataset db2 including variables: id, date, and V2.

data db2;

input id date V2;

datalines;

101 20130719 26

103 20130224 55

105 20131107 10

;

I would like to create a variable FLAG in the dataset db1 such that FLAG = 1 if db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db1.end_date. That is, I would like to obtain the following

data want;

input id start_date end_date V1 flag;

datalines;

101 20130126 20130814 19 1

101 20130729 20130829 77 .

102 20130201 20130401 14 .

103 20130217 20130225 11 1

104 20130102 20131119 47 .

;

My attempt involves PROC SQL and I'm able to isolate the desired records. However, I'm not able to attach a flag in the original dataset db1

proc sql;

create table want

as select db1.*

from db1, db2

where db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db2.end_date;

quit;

Any help would be highly appreciated.

Ksharp · Posted 09-26-2014 09:23 AM

1)Hash Table




data db1;
     input id    start_date     end_date     V1;
datalines;
101     20130126     20130814     19     
101     20130729     20130829     77
102     20130201     20130401     14
103     20130217     20130225     11
104     20130102     20131119     47
;
data db2;
     input id     date     V2;
datalines;
101     20130719     26     
103     20130224     55
105     20131107     10
;
run;

data want;
 if _n_ = 1 then do;
 if 0 then set db2;
 declare hash ha(dataset:'db2',hashexp:20);
 ha.definekey('id');
 ha.definedata('date');
 ha.definedone();
end;
set db1;
if ha.find()=0 then flag=ifn( start_date le date le end_date,1,. );
run;











2)Or if dataset have already been sorted , Merge statement 




data w;
 merge db1(in=ina) db2;
 by id;
flag=ifn( start_date le date le end_date,1,. );
if ina;
run;

Xia Keshan

Message was edited by: xia keshan

View solution in original post

Haikuo · Posted 09-25-2014 09:52 AM

proc sql;

create table want as

select *, case exists (select * from db2 where id=a.id and date between a.start_date and a.end_date) when 1 then 1 else . end as flag

from db1 a

;

quit;

RW9 · Posted 09-25-2014 09:52 AM

Hi,

proc sql;

create table WANT as

select DB1.*,

case when DB2.ID is not null then 1 else . end as FLAG

from DB1 DB1

left join DB1 DB2

on DB1.ID=DB2.ID

and DB1.START_DATE <= DB2.DATE <= DB1.END_DATE;

quit;

stat_sas · Posted 09-25-2014 09:53 AM

proc sql;

select db1.*,case when db2.date >= db1.start_date and db2.date <= db1.end_date then 1

end as flag

from db1 left join db2

on db1.id = db2.id;

quit;

mark_ph · Posted 09-26-2014 08:01 AM

, , @stat@sas, thank you for your replies.

I was wondering whether it is possible to gain a bit in efficiency because my dataset db1 has >30M records and db2 around 0.3 M. It's going to take days in this way (still running...)

Maybe knowing db1 is sorted on id and date and db2 is sorted on id and start_date, SAS could stop checking the condition for remaining records in db2 when db2.id > db1.id.

RW9 · Posted 09-26-2014 08:20 AM

I'm not thinking of an easy way. Whichever way you look at it your going to have to join the two datasets at some point. There are many tips and tricks on here to speed up things, e.g. break your 30m records down into smaller datasets (maybe on id or something). You can use indexes, get distinct list of ids which match and only do the check on those etc. However at the end of the 30m records is going to take a while regardless. Maybe also evaluate if adding a flag to the big dataset is the best way forward.

Loko · Posted 09-26-2014 08:46 AM

Hello,

proc sql;

create table want

as select db1.*,

case

when db1.id = db2.id and db2.date >= db1.start_date and db2.date <= db1.end_date then 1

else .

end as flag

from db1 left join db2 on db1.id = db2.id;

quit;

Ksharp · Posted 09-26-2014 09:23 AM

1)Hash Table




data db1;
     input id    start_date     end_date     V1;
datalines;
101     20130126     20130814     19     
101     20130729     20130829     77
102     20130201     20130401     14
103     20130217     20130225     11
104     20130102     20131119     47
;
data db2;
     input id     date     V2;
datalines;
101     20130719     26     
103     20130224     55
105     20131107     10
;
run;

data want;
 if _n_ = 1 then do;
 if 0 then set db2;
 declare hash ha(dataset:'db2',hashexp:20);
 ha.definekey('id');
 ha.definedata('date');
 ha.definedone();
end;
set db1;
if ha.find()=0 then flag=ifn( start_date le date le end_date,1,. );
run;











2)Or if dataset have already been sorted , Merge statement 




data w;
 merge db1(in=ina) db2;
 by id;
flag=ifn( start_date le date le end_date,1,. );
if ina;
run;

Xia Keshan

Message was edited by: xia keshan

mark_ph · Posted 09-28-2014 10:40 AM

Dear , your solutions are super fast and efficient. Thank you for your precious help.

May I ask you one more thing? My dataset db2 doesn't have unique id as in the toy example. That is, suppose I have the following (just added the 2nd row of db2, so that db2 doesn't have unique ids now):

data db1;

input id start_date end_date V1;

datalines;

101 20130126 20130814 19

101 20130729 20130829 77

102 20130201 20130401 14

103 20130217 20130225 11

104 20130102 20131119 47

;

data db2;

input id date V2;

datalines;

101 20130719 26

101 20130730 31

103 20130224 55

105 20131107 10

;

run;

This is the result I get right now: the flag is correctly assigned to the 1st record of db1, but not to the 2nd row of db1. That is,

data want;

input id start_date end_date V1 flag;

datalines;

101 20130126 20130814 19 1

101 20130729 20130829 77 .

102 20130201 20130401 14 .

103 20130217 20130225 11 1

104 20130102 20131119 47 .

;

I would like to obtain the following:

data want;

input id start_date end_date V1 flag;

datalines;

101 20130126 20130814 19 1

101 20130729 20130829 77 1

102 20130201 20130401 14 .

103 20130217 20130225 11 1

104 20130102 20131119 47 .

;

Please note the flag 1 in the second row this time.

Thank you for your help.

Haikuo · Posted 09-28-2014 12:25 PM

A little tweak on 's code should do:

data want;

if _n_ = 1 then do;

if 0 then set db2;

declare hash ha(dataset:'db2',hashexp:20, multidata:'y');

ha.definekey('id');

ha.definedata('date');

ha.definedone();

end;

set db1;

rc=ha.find();

if rc ne 0 then flag=.;

else

do while (rc=0);

flag=ifn( start_date le date le end_date,1,flag);

rc=ha.find_next();

end;

drop rc;

run;

Haikuo

mark_ph · Posted 09-28-2014 01:14 PM

Dear , thank you so much for your help!!

Ksharp · Posted 09-29-2014 08:08 AM

Hi Mark,

HaiKuo has already done it for me .

data db1;
     input id    start_date     end_date     V1;
datalines;
101     20130126     20130814     19     
101     20130729     20130829     77
102     20130201     20130401     14
103     20130217     20130225     11
104     20130102     20131119     47
;
data db2;
     input id     date     V2;
datalines;
101     20130719     26
101     20130730     31
103     20130224     55
105     20131107     10
;
run;

data want;
 if _n_ = 1 then do;
 if 0 then set db2;
 declare hash ha(dataset:'db2',hashexp:20,multidata:'y');
 ha.definekey('id');
 ha.definedata('date');
 ha.definedone();
end;
set db1;
rc=ha.find();
do while(rc=0);
  if start_date le date le end_date then flag=1;
  rc=ha.find_next();
end;
run;

Xia Keshan

mark_ph · Posted 09-29-2014 08:12 AM

Thank you guys for all your help.

Haikuo · Posted 09-26-2014 09:33 AM

If 'id' in db2 is unique, and both tables are presorted by 'id', then the following will give you a boost on efficiency:

data db1;

input id start_date end_date V1;

datalines;

101 20130126 20130814 19

101 20130729 20130829 77

102 20130201 20130401 14

103 20130217 20130225 11

104 20130102 20131119 47

;

data db2;

input id date V2;

datalines;

101 20130719 26

103 20130224 55

105 20131107 10

;

data want;

merge db1 db2(drop=v2);

by id;

if START_DATE <= DATE <= END_DATE then flag=1;

run;

Well, if 'id' is not unique in db2, then we may need to resort to Hash object.

Good Luck,

Haikuo

Classroom Training Available!