Solved: Conditional Count

Sandun89 · Posted 01-15-2014 10:09 AM

Hello,

I was wondering if there is a way to count observations based on parameters given in the current observation. To me this sounds like a macro, however I'm not too familiar with those and maybe you guys know how to do it in a data step or proc sql.

My data looks like this:

FL_DATE	CARRIER	FL_NUM	ORIGIN	DEST	crs_dep_time	crs_arr_time	min_connex_time	max_connex_time
2012-01-01	AA	2400	LAX	DFW	0:20	5:05	6:05	10:05
2012-01-01	AA	2043	SFO	DFW	0:25	5:35	6:35	10:35
2012-01-01	AA	2538	LAS	ORD	0:45	6:15	7:15	11:15
2012-01-01	AA	2297	JFK	MIA	5:35	8:45	9:45	13:45
2012-01-01	AA	1845	BOS	MIA	5:40	9:15	10:15	14:15
2012-01-01	AA	1776	DFW	PHL	6:10	10:05	11:05	15:05
2012-01-01	AA	2042	DFW	FLL	6:35	10:05	11:05	15:05
2012-01-01	AA	392	ORD	LGA	7:15	10:20	11:20	15:20
2012-01-01	AA	1137	ORD	MSY	7:20	9:35	10:35	14:35
2012-01-01	AA	1762	MIA	ATL	14:00	16:00	17:00	21:00

The goal is to count for each flight number (fl_num) how many other flight numbers are able to connect.

I think I should not get too much into detail but mathematically I would explain it as this:

n=1 = current obs; N+k = each following

min_connex_time(n) < crs_dep_time (n+k) < max_connex_time (n) AND Dest(n) = Origin (n+k)

In words, the first observation describes a flight from Los Angeles to Denver. This can be connected to any flight in the following observations that fullfills the conditions, i.e. departing from the origin where the current flight arrived at and departing within the given timespan of min and max_arr_time

This should generate a count variable, which displays how many times the condition was met.

The result should look like this:

FL_DATE	CARRIER	FL_NUM	ORIGIN	DEST	crs_dep_time	crs_arr_time	min_connex_time	max_connex_time	Connex
2012-01-01	AA	2400	LAX	DFW	0:20	5:05	6:05	10:05	2
2012-01-01	AA	2043	SFO	DFW	0:25	5:35	6:35	10:35	1
2012-01-01	AA	2538	LAS	ORD	0:45	6:15	7:15	11:15	1
2012-01-01	AA	2297	JFK	MIA	5:35	8:45	9:45	13:45	0
2012-01-01	AA	1845	BOS	MIA	5:40	9:15	10:15	14:15	1
2012-01-01	AA	1776	DFW	PHL	6:10	10:05	11:05	15:05	0
2012-01-01	AA	2042	DFW	FLL	6:35	10:05	11:05	15:05	0
2012-01-01	AA	392	ORD	LGA	7:15	10:20	11:20	15:20	0
2012-01-01	AA	1137	ORD	MSY	7:20	9:35	10:35	14:35	0
2012-01-01	AA	1762	MIA	ATL	14:00	16:00	17:00	21:00	0

Let me know if I need to clarify anything.

Thanks in advance for any hints.

Kind regards,

Sandun

Edit: I edited one flight number. They are usually distinct values, but in some instances of through flight two observations have the same number. But I will deal of that problem later

art297 · Posted 01-15-2014 02:51 PM

This could probably be done in one step, but I did it in two so that I could review the connection flights that were selected and counted:

proc sql;

create table all as

select a.*,b.fl_num as con_flt

from have a

left join have b

on a.dest eq b.origin

and

a.min_connex_time<=b.crs_dep_time<=max_connex_time

order by a.crs_dep_time

;

create table want as

select distinct FL_DATE, CARRIER,FL_NUM,ORIGIN,DEST,

crs_dep_time,crs_arr_time,min_connex_time,

max_connex_time,

case

when missing(con_flt) then 0

else count(*)

end as counter

from all

group by FL_NUM

order by crs_dep_time

;

quit;

View solution in original post

art297 · Posted 01-15-2014 11:08 AM

I would think that proc sql will be the easiest way to code a solution to your problem, but I do have some questions regarding your data.

You have 3 different arrival times and I don't think it is clear how they relate to each other. e.g., flt 2400, as I understand it, might not arrive until 10:05 (max_arr_time). However you indicated that it has two possible connections, even though no flights leave from DFW after 6:35.

Sandun89 · Posted 01-15-2014 11:35 AM

Ok, so min_arr_time and max_arr_time refer to the minimum connection time of +1h and maximum connection time +5h (Maybe I should've just named it that :smileygrin: I will edit the variable names in the first post) Sorry for the confusion.

So in the case of Fl_num 2400 it arrives at 5:05h in DFW. Therefore it can connect to flights that depart within 6:05h and 10:05h, in this case flights 1776, departing at 6:10h, and 2042, departing at 6:35h.

Flight 2043 can only connect to 2042, since 1776 departs 20 mins earlier than minimum connection time.

Hope this helps!

art297 · Posted 01-15-2014 02:51 PM

This could probably be done in one step, but I did it in two so that I could review the connection flights that were selected and counted:

proc sql;

create table all as

select a.*,b.fl_num as con_flt

from have a

left join have b

on a.dest eq b.origin

and

a.min_connex_time<=b.crs_dep_time<=max_connex_time

order by a.crs_dep_time

;

create table want as

select distinct FL_DATE, CARRIER,FL_NUM,ORIGIN,DEST,

crs_dep_time,crs_arr_time,min_connex_time,

max_connex_time,

case

when missing(con_flt) then 0

else count(*)

end as counter

from all

group by FL_NUM

order by crs_dep_time

;

quit;

Sandun89 · Posted 01-15-2014 03:49 PM

You sir are awesome! Why didn't I come up with that solution?

Kudos for using the join in such an interesting way.

Thank you so much. This was my first post, so now you won me over and I'll stick around and see if I can help others as well.

Haikuo · Posted 01-15-2014 10:49 PM

I am aware that Art's solution works (like it always does), here is another one using Hash.

data have;

input FL_DATE :yymmdd10. CARRIER :$2. (FL_NUM ORIGIN DEST) (:$4.) (crs_dep_time crs_arr_time min_connex_time max_connex_time) (:time5.);

cards;

2012-01-01 AA 2400 LAX DFW 0:20 5:05 6:05 10:05

2012-01-01 AA 2043 SFO DFW 0:25 5:35 6:35 10:35

2012-01-01 AA 2538 LAS ORD 0:45 6:15 7:15 11:15

2012-01-01 AA 2297 JFK MIA 5:35 8:45 9:45 13:45

2012-01-01 AA 1845 BOS MIA 5:40 9:15 10:15 14:15

2012-01-01 AA 1776 DFW PHL 6:10 10:05 11:05 15:05

2012-01-01 AA 2042 DFW FLL 6:35 10:05 11:05 15:05

2012-01-01 AA 392 ORD LGA 7:15 10:20 11:20 15:20

2012-01-01 AA 1137 ORD MSY 7:20 9:35 10:35 14:35

2012-01-01 AA 1762 MIA ATL 14:00 16:00 17:00 21:00

;

data want;

if _n_=1 then do;

declare hash h(dataset: 'have (rename=(crs_dep_time=_dep))', multidata:'y');

h.definekey('fl_date', 'carrier', 'origin');

h.definedata('_dep');

h.definedone(); call missing (_dep);

end;

set have;

rc=h.find(key: fl_date, key: carrier, key: dest);

count=0;

do rc=0 by 0 while (rc=0);

if min_connex_time < _dep < max_connex_time then count+1;

rc=h.find_next(key: fl_date, key: carrier, key: dest);

end;

drop rc _dep;

run;

Haikuo

Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

Re: Conditional Count

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away