I have data that looks like:
FLT_BIZ_UID | MSG_BIZ_UID | DATA_LINK_IND | ICOMM_IND | EDM_DATALINK | GANDER_DATALINK | GANDER_VOICE |
36184893 | 2758785634 | 0 | 0 | 0 | ||
36184893 | 2758845155 | 0 | 0 | 0 | ||
36184893 | 2758903711 | Y | 0 | 1 | 0 | |
36184893 | 2758903960 | Y | 0 | 0 | 1 | |
36184893 | 2758955914 | Y | 0 | 1 | 0 | |
36184893 | 2759004380 | Y | 0 | 1 | 0 | |
36184894 | 2758785634 | 0 | 0 | 0 | ||
36184894 | 2758845155 | 0 | 0 | 0 | ||
36184894 | 2758903711 | Y | 0 | 0 | 1 | |
36184894 | 2758903960 | Y | 0 | 0 | 1 | |
36184894 | 2758955914 | 0 | 0 | 0 | ||
36184894 | 2759004380 | 0 | 0 | 0 |
Every flt_biz_uid can have 1 or more messages (msg_biz_uid). I need to select flt_biz_uids that are voice only. Those that are voice only ALWAYS have DATA_LINK_IND missing, ICOMM_IND=Y, EDM_DATALINK=0, GANDER_DATALINK=0, and GANDER_VOICE=1.
So in the example above, flt_biz_uid 36184893 would not be chosen, but 36184894 would be. Would there be something in sql that would get his done?
Thx!
Assuming i understand the business logic
data have;
input FLT_BIZ_UID MSG_BIZ_UID DATA_LINK_IND :$1. ICOMM_IND :$1. EDM_DATALINK GANDER_DATALINK GANDER_VOICE;
cards;
36184893 2758785634 . . 0 0 0
36184893 2758845155 . . 0 0 0
36184893 2758903711 Y . 0 1 0
36184893 2758903960 . Y 0 0 1
36184893 2758955914 Y . 0 1 0
36184893 2759004380 Y . 0 1 0
36184894 2758785634 . . 0 0 0
36184894 2758845155 . . 0 0 0
36184894 2758903711 . Y 0 0 1
36184894 2758903960 . Y 0 0 1
36184894 2758955914 . . 0 0 0
36184894 2759004380 . . 0 0 0
;
proc sql;
create table want as
select *
from have
where FLT_BIZ_UID in
(select FLT_BIZ_UID from have
where DATA_LINK_IND =' ' AND ICOMM_IND = 'Y' AND EDM_DATALINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1)
group by FLT_BIZ_UID
having max(DATA_LINK_IND ne ' ' or sum(EDM_DATALINK,GANDER_DATALINK)>0)=0;
quit;
/*Or if you want just the qualifying subset i.e GANDER_VOICE=1 within FLT_BIZ_UID*/
proc sql;
create table want as
select *
from have
where FLT_BIZ_UID in
(select FLT_BIZ_UID from have
where DATA_LINK_IND =' ' AND ICOMM_IND = 'Y' AND EDM_DATALINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1)
group by FLT_BIZ_UID
having max(DATA_LINK_IND ne ' ' or sum(EDM_DATALINK,GANDER_DATALINK)>0)=0 and GANDER_VOICE;
quit;
Edit: I apparently missed the most important part, you don't want the first BIZID to appear. I'm updating the code will have it shortly...
PROC SQL;
CREATE TABLE Want AS
SELECT *
FROM HAVE
WHERE DATA_LINK_IND ="" AND ICOMM_IND = 'Y' AND EDM_DATELINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1;
QUIT;
After reviewing further help me understand why one vs the other is to appear? They both have a mix of data; both apply to the VOICE requirements you set.
If you have some requirements you can create a separate dataset outlining the DISTINCT ID's that should be excluded and then use the below to query that other dataset.
PROC SQL;
CREATE TABLE Want AS
SELECT *
FROM HAVE
WHERE DATA_LINK_IND ="" AND ICOMM_IND = 'Y' AND EDM_DATELINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1
AND FLT_BIZ_UID NOT IN (SELECT FLT_BIZ_UID FROM work.FilterList) ;
QUIT;
This would not work as it would return 36184893. It has to be on on every message within a flt_biz_uid. That is, the only observations that can occur WITHIN a flight_biz_uid are
DATA_LINK_IND ="" AND ICOMM_IND = 'Y' AND EDM_DATELINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1;
you will note that 36184893 has other messages that would preclude its selection
To reiterate:
Each flit_biz_uid (a flight) can have multiple messages (msg_biz_uid). Each message is a type of communication. I am only interested in flt_biz_uids (flights) that have ONLY voice communication throughout the ENTIRE flight. This means that the only messages that should exist for a flight are:
DATA_LINK_IND missing, ICOMM_IND=Y, EDM_DATALINK=0, GANDER_DATALINK=0, and GANDER_VOICE=1 for all messages in a flight.
So, for the first flight, flt_biz_uid of 36184893, it would not be chosen because there are OTHER messages that include other conditions.
Now, flt_biz_uid 36184894 would be chosen as ALL messages for that flight satisfy:
DATA_LINK_IND missing, ICOMM_IND=Y, EDM_DATALINK=0, GANDER_DATALINK=0, and GANDER_VOICE=1.
Thx
How/why do the first 2 AND last 2 observations not come into play here then?
36184894 | 2758785634 | 0 | 0 | 0 | ||
36184894 | 2758845155 | 0 | 0 | 0 |
36184894 | 2758955914 | 0 | 0 | 0 | ||
36184894 | 2759004380 | 0 | 0 | 0 |
you bring up an EXCELLENT point. I had to check with our radar people. So:
DATA_LINK_IND=missing, ICOMM_IND=missing, EDM_DATALINK=0, GANDER_DATALINK=0, GANDER_VOICE=0 really means an error occurred for that message. So this complicates everything. So:
1. For a given FLIGHT, remove all instances of DATA_LINK_IND=missing, ICOMM_IND=missing, EDM_DATALINK=0, GANDER_DATALINK=0, GANDER_VOICE=0
2. Then check to see for the remaining messages if only the following is true and return those flights
DATA_LINK_IND missing, ICOMM_IND=Y, EDM_DATALINK=0, GANDER_DATALINK=0, and GANDER_VOICE=1.
Assuming i understand the business logic
data have;
input FLT_BIZ_UID MSG_BIZ_UID DATA_LINK_IND :$1. ICOMM_IND :$1. EDM_DATALINK GANDER_DATALINK GANDER_VOICE;
cards;
36184893 2758785634 . . 0 0 0
36184893 2758845155 . . 0 0 0
36184893 2758903711 Y . 0 1 0
36184893 2758903960 . Y 0 0 1
36184893 2758955914 Y . 0 1 0
36184893 2759004380 Y . 0 1 0
36184894 2758785634 . . 0 0 0
36184894 2758845155 . . 0 0 0
36184894 2758903711 . Y 0 0 1
36184894 2758903960 . Y 0 0 1
36184894 2758955914 . . 0 0 0
36184894 2759004380 . . 0 0 0
;
proc sql;
create table want as
select *
from have
where FLT_BIZ_UID in
(select FLT_BIZ_UID from have
where DATA_LINK_IND =' ' AND ICOMM_IND = 'Y' AND EDM_DATALINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1)
group by FLT_BIZ_UID
having max(DATA_LINK_IND ne ' ' or sum(EDM_DATALINK,GANDER_DATALINK)>0)=0;
quit;
/*Or if you want just the qualifying subset i.e GANDER_VOICE=1 within FLT_BIZ_UID*/
proc sql;
create table want as
select *
from have
where FLT_BIZ_UID in
(select FLT_BIZ_UID from have
where DATA_LINK_IND =' ' AND ICOMM_IND = 'Y' AND EDM_DATALINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1)
group by FLT_BIZ_UID
having max(DATA_LINK_IND ne ' ' or sum(EDM_DATALINK,GANDER_DATALINK)>0)=0 and GANDER_VOICE;
quit;
@BCNAV wrote:
you bring up an EXCELLENT point. I had to check with our radar people. So:
DATA_LINK_IND=missing, ICOMM_IND=missing, EDM_DATALINK=0, GANDER_DATALINK=0, GANDER_VOICE=0 really means an error occurred for that message. So this complicates everything. So:
1. For a given FLIGHT, remove all instances of DATA_LINK_IND=missing, ICOMM_IND=missing, EDM_DATALINK=0, GANDER_DATALINK=0, GANDER_VOICE=0
2. Then check to see for the remaining messages if only the following is true and return those flights
DATA_LINK_IND missing, ICOMM_IND=Y, EDM_DATALINK=0, GANDER_DATALINK=0, and GANDER_VOICE=1.
That makes more sense. So you can delete that data first.
DATA HAVE; SET HAVE; IF DATA_LINK_IND = " " AND ICOMM_IND = " " .... THEN DELETE; RUN;
Then it looks like @novinosrin code is the only one you could copy paste and make no other changes in order to get the correct data.
Thank you all for your assistance!!
All this works great....but I am having an issues understanding the having statement. Suppose I want to change this to pull up errors. So for a flight series (whole flight) consists of
DATA_LINK_IND is missing, ICOMM_IND is missing, EDM_DATALINK=0, GANDER_DATALINK=0, and GANDER_VOICE=0.
How would there "where" and "having" statement change?
In addition
DATA_LINK_IND is not missing or ICOMM_IND is not missing , EDM_DATALINK=0, GANDER_DATALINK=1 or GANDER_VOICE=1
How would there "where" and "having" statement change?
Hi @BCNAV Let me explain the logic that the solution utilizes based upon assumptions of sorts
The following is how the algorithm works for the SQL processor to execute. Of course SQL optimizer would do other combinations in the join but that's beyond the scope of this thread.
So,
1. The subquery
(select FLT_BIZ_UID from have
where DATA_LINK_IND =' ' AND ICOMM_IND = 'Y' AND EDM_DATALINK = 0 AND GANDER_DATALINK = 0 AND GANDER_VOICE = 1)
will result in a subset of FLT_BIZ_UID of our interest, in other words the true records.
2. The WHERE clause that is part of the outer query i.e where FLT_BIZ_UID in (results of previous) would mean the outer query would result in matching FLT_BIZ_UID 's of interest however with both records that are true and not true. This in essence becomes a super set with which we are gonna play with.
3. Now, we group the resulting superset from 1 & 2 using a GROUP BY clause to examine whether a particular group would qualify for our final desired result.
4. Since we are aware HAVING clause generally works with filtering groups based on a summary/condition, we begin to utilise this clause at this point. Let's see how
5. The data seems to suggest for any valid ID's i.e all ID's in our superset are anyway valid because of our initial screening and filter with a sub-query that we used in the where clause as detailed in 1 & 2.
6. The False/untrue records within the superset are apparently the ones with DATA_LINK_IND with "Y" taken as pointing to something different or one of the numeric columns i.e EDM_DATALINK,GANDER_DATALINK would have a NON-ZERO value. This assumption pretty much classifies the UNTRUE/FALSE records within valid ID's
7. Since it's a group, only one or some records are likely to fall into the False category and therefore we need to determine how to Flag that for the entire group to be dropped considering the fact it contains both False and true records.
8. Therefore the condition described in point 6 is basically converted into a Boolean expression resulting in 1's for False records, 0's for true records.
9. Point 8 would result in ID's with false and true records to have both 1 and 0's while ID's with no false records to only have 0's in the Boolean expression.
10. Now we need to find a way to drop the entire ID group should it possess at least one FALSE record, therefore computing a MAX of the Boolean(1,0) would help us in determining the same and so if the max is zero for an ID group, we can safely subset that to our final result.-->max(DATA_LINK_IND ne ' ' or sum(EDM_DATALINK,GANDER_DATALINK)>0)=0
I hope the above details are clear enough to understand. If you do, I am sure you can play around considering that you know the algorithm behind the SQL syntax. Of course feel free to let us know any change once you think we are in the same page and then we shall jump on to the next. I am more than happy to work interactively meaning going back and forth. 🙂
Thanks novinosrin...helpful! However, what about the scenario when a flight has BOTH voice and data messages?
FLT_BIZ_UID | MSG_BIZ_UID | DATA_LINK_IND | ICOMM_IND | EDM_DATALINK | GANDER_DATALINK | GANDER_VOICE |
36174234 | 2758262666 | Y | 0 | 0 | 1 | |
36174234 | 2758262766 | Y | 0 | 1 | 0 | |
36174234 | 2758262822 | Y | 0 | 0 | 1 | |
36174234 | 2758269036 | Y | 0 | 1 | 0 | |
36174234 | 2758299683 | 0 | 0 | 0 | ||
36174234 | 2758299684 | Y | 0 | 1 | 0 | |
36174234 | 2758323649 | Y | 0 | 1 | 0 |
In this case I think the where statement would be:
where (ICOMM_IND = 'Y' or DATA_LINK_IND = 'Y') AND EDM_DATALINK = 0 AND (GANDER_DATALINK = 1 OR GANDER_VOICE = 1))
but I am unsure of the having statement (my crack is below):
having max((ICOMM_IND is not missing or DATA_LINK_IND is not missing) or sum(EDM_DATALINK,GANDER_DATALINK,GANDER_VOICE)>1) = 0
Thanks again!
Okay, so if you are trying to change the WHERE clause , that means we really have to make sure in determining what constitutes to being true records. So basically the "where" shouldn't change as that's what the business rule is. Once that is absolute, anything that's not true is false. Therefore, the diligence applies in establishing the business rule first i.e to be applied in where clause.
Once that is done, we would check for ID's with true and false records and drop those or in other words keep only those ID's with true records.
A change in business rule would warrant a change in WHERE clause is basically the principle, which essentially means that becomes a different program
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.