Hello,
I was wondering if there is a way to count observations based on parameters given in the current observation. To me this sounds like a macro, however I'm not too familiar with those and maybe you guys know how to do it in a data step or proc sql.
My data looks like this:
FL_DATE | CARRIER | FL_NUM | ORIGIN | DEST | crs_dep_time | crs_arr_time | min_connex_time | max_connex_time |
2012-01-01 | AA | 2400 | LAX | DFW | 0:20 | 5:05 | 6:05 | 10:05 |
2012-01-01 | AA | 2043 | SFO | DFW | 0:25 | 5:35 | 6:35 | 10:35 |
2012-01-01 | AA | 2538 | LAS | ORD | 0:45 | 6:15 | 7:15 | 11:15 |
2012-01-01 | AA | 2297 | JFK | MIA | 5:35 | 8:45 | 9:45 | 13:45 |
2012-01-01 | AA | 1845 | BOS | MIA | 5:40 | 9:15 | 10:15 | 14:15 |
2012-01-01 | AA | 1776 | DFW | PHL | 6:10 | 10:05 | 11:05 | 15:05 |
2012-01-01 | AA | 2042 | DFW | FLL | 6:35 | 10:05 | 11:05 | 15:05 |
2012-01-01 | AA | 392 | ORD | LGA | 7:15 | 10:20 | 11:20 | 15:20 |
2012-01-01 | AA | 1137 | ORD | MSY | 7:20 | 9:35 | 10:35 | 14:35 |
2012-01-01 | AA | 1762 | MIA | ATL | 14:00 | 16:00 | 17:00 | 21:00 |
The goal is to count for each flight number (fl_num) how many other flight numbers are able to connect.
I think I should not get too much into detail but mathematically I would explain it as this:
n=1 = current obs; N+k = each following
min_connex_time(n) < crs_dep_time (n+k) < max_connex_time (n) AND Dest(n) = Origin (n+k)
In words, the first observation describes a flight from Los Angeles to Denver. This can be connected to any flight in the following observations that fullfills the conditions, i.e. departing from the origin where the current flight arrived at and departing within the given timespan of min and max_arr_time
This should generate a count variable, which displays how many times the condition was met.
The result should look like this:
FL_DATE | CARRIER | FL_NUM | ORIGIN | DEST | crs_dep_time | crs_arr_time | min_connex_time | max_connex_time | Connex |
2012-01-01 | AA | 2400 | LAX | DFW | 0:20 | 5:05 | 6:05 | 10:05 | 2 |
2012-01-01 | AA | 2043 | SFO | DFW | 0:25 | 5:35 | 6:35 | 10:35 | 1 |
2012-01-01 | AA | 2538 | LAS | ORD | 0:45 | 6:15 | 7:15 | 11:15 | 1 |
2012-01-01 | AA | 2297 | JFK | MIA | 5:35 | 8:45 | 9:45 | 13:45 | 0 |
2012-01-01 | AA | 1845 | BOS | MIA | 5:40 | 9:15 | 10:15 | 14:15 | 1 |
2012-01-01 | AA | 1776 | DFW | PHL | 6:10 | 10:05 | 11:05 | 15:05 | 0 |
2012-01-01 | AA | 2042 | DFW | FLL | 6:35 | 10:05 | 11:05 | 15:05 | 0 |
2012-01-01 | AA | 392 | ORD | LGA | 7:15 | 10:20 | 11:20 | 15:20 | 0 |
2012-01-01 | AA | 1137 | ORD | MSY | 7:20 | 9:35 | 10:35 | 14:35 | 0 |
2012-01-01 | AA | 1762 | MIA | ATL | 14:00 | 16:00 | 17:00 | 21:00 | 0 |
Let me know if I need to clarify anything.
Thanks in advance for any hints.
Kind regards,
Sandun
Edit: I edited one flight number. They are usually distinct values, but in some instances of through flight two observations have the same number. But I will deal of that problem later
This could probably be done in one step, but I did it in two so that I could review the connection flights that were selected and counted:
proc sql;
create table all as
select a.*,b.fl_num as con_flt
from have a
left join have b
on a.dest eq b.origin
and
a.min_connex_time<=b.crs_dep_time<=max_connex_time
order by a.crs_dep_time
;
create table want as
select distinct FL_DATE, CARRIER,FL_NUM,ORIGIN,DEST,
crs_dep_time,crs_arr_time,min_connex_time,
max_connex_time,
case
when missing(con_flt) then 0
else count(*)
end as counter
from all
group by FL_NUM
order by crs_dep_time
;
quit;
I would think that proc sql will be the easiest way to code a solution to your problem, but I do have some questions regarding your data.
You have 3 different arrival times and I don't think it is clear how they relate to each other. e.g., flt 2400, as I understand it, might not arrive until 10:05 (max_arr_time). However you indicated that it has two possible connections, even though no flights leave from DFW after 6:35.
Ok, so min_arr_time and max_arr_time refer to the minimum connection time of +1h and maximum connection time +5h (Maybe I should've just named it that :smileygrin: I will edit the variable names in the first post) Sorry for the confusion.
So in the case of Fl_num 2400 it arrives at 5:05h in DFW. Therefore it can connect to flights that depart within 6:05h and 10:05h, in this case flights 1776, departing at 6:10h, and 2042, departing at 6:35h.
Flight 2043 can only connect to 2042, since 1776 departs 20 mins earlier than minimum connection time.
Hope this helps!
This could probably be done in one step, but I did it in two so that I could review the connection flights that were selected and counted:
proc sql;
create table all as
select a.*,b.fl_num as con_flt
from have a
left join have b
on a.dest eq b.origin
and
a.min_connex_time<=b.crs_dep_time<=max_connex_time
order by a.crs_dep_time
;
create table want as
select distinct FL_DATE, CARRIER,FL_NUM,ORIGIN,DEST,
crs_dep_time,crs_arr_time,min_connex_time,
max_connex_time,
case
when missing(con_flt) then 0
else count(*)
end as counter
from all
group by FL_NUM
order by crs_dep_time
;
quit;
You sir are awesome! Why didn't I come up with that solution?
Kudos for using the join in such an interesting way.
Thank you so much. This was my first post, so now you won me over and I'll stick around and see if I can help others as well.
I am aware that Art's solution works (like it always does), here is another one using Hash.
data have;
input FL_DATE :yymmdd10. CARRIER :$2. (FL_NUM ORIGIN DEST) (:$4.) (crs_dep_time crs_arr_time min_connex_time max_connex_time) (:time5.);
cards;
2012-01-01 AA 2400 LAX DFW 0:20 5:05 6:05 10:05
2012-01-01 AA 2043 SFO DFW 0:25 5:35 6:35 10:35
2012-01-01 AA 2538 LAS ORD 0:45 6:15 7:15 11:15
2012-01-01 AA 2297 JFK MIA 5:35 8:45 9:45 13:45
2012-01-01 AA 1845 BOS MIA 5:40 9:15 10:15 14:15
2012-01-01 AA 1776 DFW PHL 6:10 10:05 11:05 15:05
2012-01-01 AA 2042 DFW FLL 6:35 10:05 11:05 15:05
2012-01-01 AA 392 ORD LGA 7:15 10:20 11:20 15:20
2012-01-01 AA 1137 ORD MSY 7:20 9:35 10:35 14:35
2012-01-01 AA 1762 MIA ATL 14:00 16:00 17:00 21:00
;
data want;
if _n_=1 then do;
declare hash h(dataset: 'have (rename=(crs_dep_time=_dep))', multidata:'y');
h.definekey('fl_date', 'carrier', 'origin');
h.definedata('_dep');
h.definedone(); call missing (_dep);
end;
set have;
rc=h.find(key: fl_date, key: carrier, key: dest);
count=0;
do rc=0 by 0 while (rc=0);
if min_connex_time < _dep < max_connex_time then count+1;
rc=h.find_next(key: fl_date, key: carrier, key: dest);
end;
drop rc _dep;
run;
Haikuo
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.