Conditional Count

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

Conditional Count

Hello,

I was wondering if there is a way to count observations based on parameters given in the current observation. To me this sounds like a macro, however I'm not too familiar with those and maybe you guys know how to do it in a data step or proc sql.

My data looks like this:

FL_DATECARRIERFL_NUMORIGINDESTcrs_dep_timecrs_arr_timemin_connex_timemax_connex_time
2012-01-01AA2400LAXDFW0:205:056:0510:05
2012-01-01AA2043SFODFW0:255:356:3510:35
2012-01-01AA2538LASORD0:456:157:1511:15
2012-01-01AA2297JFKMIA5:358:459:4513:45
2012-01-01AA1845BOSMIA5:409:1510:1514:15
2012-01-01AA1776DFWPHL6:1010:0511:0515:05
2012-01-01AA2042DFWFLL6:3510:0511:0515:05
2012-01-01AA392ORDLGA7:1510:2011:2015:20
2012-01-01AA1137ORDMSY7:209:3510:3514:35
2012-01-01AA1762MIAATL14:0016:0017:0021:00

The goal is to count for each flight number (fl_num) how many other flight numbers are able to connect.

I think I should not get too much into detail but mathematically I would explain it as this:

n=1 = current obs; N+k = each following

min_connex_time(n) < crs_dep_time (n+k) < max_connex_time (n)      AND      Dest(n) = Origin (n+k)

In words, the first observation describes a flight from Los Angeles to Denver. This can be connected to any flight in the following observations that fullfills the conditions, i.e. departing from the origin where the current flight arrived at and departing within the given timespan of min and max_arr_time

This should generate a count variable, which displays how many times the condition was met.

The result should look like this:

FL_DATECARRIERFL_NUMORIGINDESTcrs_dep_timecrs_arr_timemin_connex_timemax_connex_timeConnex
2012-01-01AA2400LAXDFW0:205:056:0510:052
2012-01-01AA2043SFODFW0:255:356:3510:351
2012-01-01AA2538LASORD0:456:157:1511:151
2012-01-01AA2297JFKMIA5:358:459:4513:450
2012-01-01AA1845BOSMIA5:409:1510:1514:151
2012-01-01AA1776DFWPHL6:1010:0511:0515:050
2012-01-01AA2042DFWFLL6:3510:0511:0515:050
2012-01-01AA392ORDLGA7:1510:2011:2015:200
2012-01-01AA1137ORDMSY7:209:3510:3514:350
2012-01-01AA1762MIAATL14:0016:0017:0021:000

Let me know if I need to clarify anything.

Thanks in advance for any hints.

Kind regards,

Sandun

Edit: I edited one flight number. They are usually distinct values, but in some instances of through flight two observations have the same number. But I will deal of that problem later


Accepted Solutions
Solution
‎01-15-2014 02:52 PM
PROC Star
Posts: 7,468

Re: Conditional Count

This could probably be done in one step, but I did it in two so that I could review the connection flights that were selected and counted:

proc sql;

  create table all as

    select a.*,b.fl_num as con_flt

      from have a

        left join have b

         on a.dest eq b.origin

          and

         a.min_connex_time<=b.crs_dep_time<=max_connex_time

           order by a.crs_dep_time

  ;

  create table want as

    select distinct FL_DATE, CARRIER,FL_NUM,ORIGIN,DEST,

                    crs_dep_time,crs_arr_time,min_connex_time,

                    max_connex_time,

                    case

                      when missing(con_flt) then 0

                      else count(*)

                    end as counter

      from all

        group by FL_NUM

          order by crs_dep_time

  ;

quit;

View solution in original post


All Replies
PROC Star
Posts: 7,468

Re: Conditional Count

I would think that proc sql will be the easiest way to code a solution to your problem, but I do have some questions regarding your data.

You have 3 different arrival times and I don't think it is clear how they relate to each other.  e.g., flt 2400, as I understand it, might not arrive until 10:05 (max_arr_time).  However you indicated that it has two possible connections, even though no flights leave from DFW after 6:35.

New Contributor
Posts: 3

Re: Conditional Count

Ok, so min_arr_time and max_arr_time refer to the minimum connection time of +1h and maximum connection time +5h (Maybe I should've just named it that :smileygrin:  I will edit the variable names in the first post) Sorry for the confusion.

So in the case of Fl_num 2400 it arrives at 5:05h in DFW. Therefore it can connect to flights that depart within 6:05h and 10:05h, in this case flights 1776, departing at 6:10h, and 2042, departing at 6:35h.

Flight 2043 can only connect to 2042, since 1776 departs 20 mins earlier than minimum connection time.

Hope this helps!

Solution
‎01-15-2014 02:52 PM
PROC Star
Posts: 7,468

Re: Conditional Count

This could probably be done in one step, but I did it in two so that I could review the connection flights that were selected and counted:

proc sql;

  create table all as

    select a.*,b.fl_num as con_flt

      from have a

        left join have b

         on a.dest eq b.origin

          and

         a.min_connex_time<=b.crs_dep_time<=max_connex_time

           order by a.crs_dep_time

  ;

  create table want as

    select distinct FL_DATE, CARRIER,FL_NUM,ORIGIN,DEST,

                    crs_dep_time,crs_arr_time,min_connex_time,

                    max_connex_time,

                    case

                      when missing(con_flt) then 0

                      else count(*)

                    end as counter

      from all

        group by FL_NUM

          order by crs_dep_time

  ;

quit;

New Contributor
Posts: 3

Re: Conditional Count

You sir are awesome! Why didn't I come up with that solution?

Kudos for using the join in such an interesting way.

Thank you so much. This was my first post, so now you won me over and I'll stick around and see if I can help others as well.

Respected Advisor
Posts: 3,156

Re: Conditional Count

I am aware that Art's solution works (like it always does), here is another one using Hash.

data have;

input FL_DATE :yymmdd10. CARRIER :$2. (FL_NUM ORIGIN DEST) (:$4.) (crs_dep_time crs_arr_time min_connex_time max_connex_time) (:time5.);

cards;

2012-01-01 AA 2400 LAX DFW 0:20 5:05 6:05 10:05

2012-01-01 AA 2043 SFO DFW 0:25 5:35 6:35 10:35

2012-01-01 AA 2538 LAS ORD 0:45 6:15 7:15 11:15

2012-01-01 AA 2297 JFK MIA 5:35 8:45 9:45 13:45

2012-01-01 AA 1845 BOS MIA 5:40 9:15 10:15 14:15

2012-01-01 AA 1776 DFW PHL 6:10 10:05 11:05 15:05

2012-01-01 AA 2042 DFW FLL 6:35 10:05 11:05 15:05

2012-01-01 AA 392    ORD LGA 7:15 10:20 11:20 15:20

2012-01-01 AA 1137 ORD MSY 7:20 9:35 10:35 14:35

2012-01-01 AA 1762 MIA ATL 14:00 16:00 17:00 21:00

;

data want;

  if _n_=1 then do;

    declare hash h(dataset: 'have (rename=(crs_dep_time=_dep))', multidata:'y');

  h.definekey('fl_date', 'carrier', 'origin');

  h.definedata('_dep');

  h.definedone(); call missing (_dep);

  end;

  set have; 

 

    rc=h.find(key: fl_date, key: carrier, key: dest);

  count=0;

  do rc=0 by 0 while (rc=0);

   if min_connex_time < _dep < max_connex_time then count+1;

    rc=h.find_next(key: fl_date, key: carrier, key: dest);

  end;

  drop rc _dep;

run;

Haikuo

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 3031 views
  • 2 likes
  • 3 in conversation