- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have raw flight data that I need to make selections from. Each flight is assigned a unique flt_biz_uid. Each flight can have many messages (msg_biz_uid) based on a flight plan. A message can contain a waypoint (flt_pint_code). A paired down flightplan for one flight is:
flt_biz_uid | msg_biz_uid | flt_pt_code |
37532375 | 2859985831 | IGN |
37532375 | 2859985832 | BETBE |
37532375 | 2859985833 | KSFO |
37532375 | 2859985834 | KJFK |
37532375 | 2859985835 | ZORUN |
37532375 | 2859985836 | GALLI |
37532375 | 2859985837 | ORRCA |
37532375 | 2859985838 | PRNCE |
37532375 | 2859985839 | LORAH |
37532375 | 2859985840 | RUBKI |
37532375 | 2859985841 | DAYYY |
37532375 | 2859985842 | SIKBO |
37532375 | 2859985843 | TULEG |
37532375 | 2859985844 | RAP |
37532375 | 2859985845 | CZI |
37532375 | 2859985846 | BOY |
37532375 | 2859985847 | MLD |
This goes on for millions of flights. I would like to be able to pick out flt_biz_uids that have specific routes (waypoints). If the above was the complete dataset, and if I wanted to pull out flights that included RUBKI, SIKBO, and TULEG, then ALL of the data above would be pulled. Had I wanted, say, SNOOPY1, SNOOPY2, SNOOPY3 from the above dataset example, then nothing would be returned. I am thinking this maybe SQL? Not sure. Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @BCNAV To my mind, it does seem you are in the right track to think of SQL. Am i understanding the req correctly?
data have;
input (flt_biz_uid msg_biz_uid flt_pt_code) (:$20.);
cards;
37532375 2859985831 IGN
37532375 2859985832 BETBE
37532375 2859985833 KSFO
37532375 2859985834 KJFK
37532375 2859985835 ZORUN
37532375 2859985836 GALLI
37532375 2859985837 ORRCA
37532375 2859985838 PRNCE
37532375 2859985839 LORAH
37532375 2859985840 RUBKI
37532375 2859985841 DAYYY
37532375 2859985842 SIKBO
37532375 2859985843 TULEG
37532375 2859985844 RAP
37532375 2859985845 CZI
37532375 2859985846 BOY
37532375 2859985847 MLD
;
%let select_flightpoints='RUBKI', 'SIKBO', 'TULEG';
proc sql;
create table want as
select *
from have
where flt_biz_uid in (select flt_biz_uid from have where flt_pt_code in (&select_flightpoints))
order by flt_biz_uid,msg_biz_uid;
quit;
/*Or*/
Should all select_flightpoints be strictly present then , the below SQL version should suffice
%let select_flightpoints='RUBKI', 'SIKBO', 'TULEG';
proc sql;
create table want as
select *
from have
where flt_biz_uid in
(select flt_biz_uid
from have
where flt_pt_code in (&select_flightpoints)
group by flt_biz_uid
having count(distinct flt_pt_code)=countw("&select_flightpoints",','));
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @BCNAV To my mind, it does seem you are in the right track to think of SQL. Am i understanding the req correctly?
data have;
input (flt_biz_uid msg_biz_uid flt_pt_code) (:$20.);
cards;
37532375 2859985831 IGN
37532375 2859985832 BETBE
37532375 2859985833 KSFO
37532375 2859985834 KJFK
37532375 2859985835 ZORUN
37532375 2859985836 GALLI
37532375 2859985837 ORRCA
37532375 2859985838 PRNCE
37532375 2859985839 LORAH
37532375 2859985840 RUBKI
37532375 2859985841 DAYYY
37532375 2859985842 SIKBO
37532375 2859985843 TULEG
37532375 2859985844 RAP
37532375 2859985845 CZI
37532375 2859985846 BOY
37532375 2859985847 MLD
;
%let select_flightpoints='RUBKI', 'SIKBO', 'TULEG';
proc sql;
create table want as
select *
from have
where flt_biz_uid in (select flt_biz_uid from have where flt_pt_code in (&select_flightpoints))
order by flt_biz_uid,msg_biz_uid;
quit;
/*Or*/
Should all select_flightpoints be strictly present then , the below SQL version should suffice
%let select_flightpoints='RUBKI', 'SIKBO', 'TULEG';
proc sql;
create table want as
select *
from have
where flt_biz_uid in
(select flt_biz_uid
from have
where flt_pt_code in (&select_flightpoints)
group by flt_biz_uid
having count(distinct flt_pt_code)=countw("&select_flightpoints",','));
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the table is not sorted, something like this should work:
proc sql;
select a.*
from TAB a
,TAB b
,TAB c
,TAB d
where b.FLT_PT_CODE = 'SNOOPY1'
and c.FLT_PT_CODE = 'SNOOPY2'
and d.FLT_PT_CODE = 'SNOOPY3'
and a.FLT_BIZ_UID = b.FLT_BIZ_UID
and a.FLT_BIZ_UID = c.FLT_BIZ_UID
and a.FLT_BIZ_UID = d.FLT_BIZ_UID ;
If it is sorted, there may be faster methods.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If the data is sorted by flt_biz_uid then do:
%let select_flightpoints=RUBKI SIKBO TULEG;
data want;
array sf{*} &select_flightpoints;
do until(last.flt_biz_uid);
set have; by flt_biz_uid;
do i = 1 to dim(sf);
if flt_pt_code = vname(sf{i}) then sf{i} = 1;
end;
end;
do until(last.flt_biz_uid);
set have; by flt_biz_uid;
if sum(of sf{*}) = dim(sf) then output;
end;
drop i &select_flightpoints;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If your data are sorted by here's a proof-of-concept for the Data step:
data want (drop=nfound);
set have (where= flt_pt_cod in ('RUBKI','SIKBO','TULEG') in=foundwaypoints)
have (in=allwaypoints)
by FLTBIZ_UID ;
if first.fltbiz_uid=1 then nfound=0;
nfound+foundwaypoints;
if allwaypoints=1 and nfound=3;
run;
Of course, this assumes that no waypoint is recorded more than once per flight.
I assume the SQL solution is more common, but this may well be faster, primarily because it takes advantage of the fact that the data are sorted. So, unlike the SQL solution, it doesn't have to keep somewhere in memory the flight id's of all the flights (both qualifying and not), in case a flight reappears later in the data set.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------