Solved: Re: Complex cross referencing within groups

Yegen · Posted 06-14-2017 03:32 AM

I am trying to construct a new variable called "Control" for each ID (i.e., Row) by comparing the IDs that each MGRNO manages in a given Group. There are only 2 managers (i.e., MGRNOs) within a given group and all the conditions are by Group (comparing the IDs of each MGRNO to the other MGRNO in the given Group). The rules are as follows:

Control=1 whenever at least two identical IDs within the same Industry are managed by both MGRNOs (e.g., ID=23298321 and ID=53228424 are in the Industry and managed by both MGRNOs in the given Group). However, if there is only one ID managed by one MGRNO in a given Industry and the other MGRNO manages other IDs including the identical ID then Control=0 for that particular case (e.g., MGRNO 1 manages only ID=12690310 and MGRNO 2 manages ID=12690310 and ID=12690311 which are in the same industry). If there are at least two identical IDs (e.g., 123 and 321) managed by both MGRNOs, then Control=1. However, if it's only one identical ID (e.g., case of 12690310), then Control=0.
Control=1 whenever two (or more) non-identical IDs within the same Industry are managed by both MGRNOs (e.g., MGRNO 2 manages 99998881 and 01737211, whereas MGRNO 1 manages 12290311 but all are in the same industry).
Else Control = 0. Examples when Control=0 (1) whenever two non-identical IDs within the same Industry are managed only by one MGRNO and there is no ID within the same Idustry that is managed by the other MGRNO that is in the same group (e.g., MGRNO 1 manages IDs 123, 234, and 345 that are in the same Industry but MGRNO 2 does not manage any IDs in that particular industry). Control=0 whenever there is only one ID in a given Industry.

Update: Please see the picture below for clarification.

Here is my sample dataset. Ideally, the output needs to look like column "Control". I have added some clarification about the output below as well:

*The column "want" is the output I need and "pair" is the corresponding 
relationship with the other Manager (MGRNO). The pair tracks which ID pair is its match. All analysis is done by Group and the IDs of each MGRNOs within the Group are compared. There are only two MGRNOs in a given Group;

DATA sample;
	INPUT Mgrno Group ID Industry Control Pair;
	DATALINES; 
	9999 1 00462610 1842 0 .
	9999 1 01737210 1842 0 .
	9999 1 12690310 2832 0 .
	9999 1 23298321 4238 1 1
	9999 1 53228424 4238 1 1
	9999 1 12290311 1233 1 2
	8888 1 01737211 1233 1 2
	8888 1 99998881 1233 1 2
	8888 1 23298321 4238 1 1
	8888 1 53228424 4238 1 1
	8888 1 37158620 2124 0 .
	8888 1 12690310 2832 0 .
	8888 1 12690311 2832 0 .
	7777 2 12690310 2832 1 1
	7777 2 37158620 2124 0 .
	7777 2 38138222 3321 1 2
	7777 2 53228424 4238 1 3
	3333 2 32690311 2832 1 1
	3333 2 44134422 3321 1 2 
	3333 2 77233421 4238 1 3 
	3333 2 12290311 1233 0 .
	3333 2 93234320 1333 0 .
;
RUN;

Following the suggestions made by @Shmuel and @Ksharp in a related questions, I have tried the following but I am stuck moving forward. Any help and suggestions will be appreciated.

proc sort data=sample;
    by Group Industry ID;
quit;

*Deleting the Industries with only one ID;
DATA want_V1;
    set sample;
    by Group Industry ID;
    if not (first.Industry and last.Industry); 
RUN;

data want_V2;
	set want_V1;
	by Group Industry;
	retain lagID;
	if lagID^=ID then Possible=1;
	lagID=ID;
run;

gamotte · Posted 06-15-2017 04:35 AM

Hello,

I haven't read all your explanations because I'm lazy but the following program seems to give the expected output :

proc sql noprint;
	/* We exclude industries associated with only one manager */
	CREATE TABLE Sample_wo_1_mgr_industries
	AS SELECT *
	FROM Sample
	GROUP BY Group, Industry
	HAVING count(DISTINCT MGRNO) ne 1;

	/* We identify IDs having both managers in the group */
	CREATE TABLE CNT_MGR AS
	SELECT DISTINCT MGRNO, GROUP, INDUSTRY, ID, count(*)=2 AS MGR2
	FROM Sample_wo_1_mgr_industries
	GROUP BY Group, Industry, ID
	;

	/* We set Control=1 if there is not exactly one ID prsent for both manager in the given Group/Industry , 0 otherwise */
	CREATE TABLE Sample_with_Control AS
	SELECT a.*, sum(b.Control,0) AS Control
	FROM Sample a
	LEFT JOIN (
		SELECT DISTINCT GROUP, INDUSTRY, sum(MGR2) AS SUM_MGR2, 1 AS Control
		FROM CNT_MGR
		GROUP BY Group, Industry
		HAVING sum(MGR2) ne 2
	) b
	ON b.Group=a.Group AND b.Industry=a.Industry
	ORDER BY a.Group, a.Industry, a.Mgrno, a.ID
	;
quit;

View solution in original post

gamotte · Posted 06-14-2017 06:10 AM

Hello,

The way you describe your problem seems unnecessarily complex.

Can we rephrase 1 and 2 as :

Whenever, for a given group, the same industry appears with both managers,

all rows associated with this group and industry will have Control=1.

Control=0 on the other rows.

If so,

- in your example, Group 1, Industry 2832 should have Control=1 instead of 0.

- the following query gets the rows with Control=1

proc sql noprint;
	CREATE TABLE CTRL AS
	SELECT *, 1 AS Control
	FROM Sample
	GROUP BY Group, Industry
	Having count(DISTINCT MGRNO)=2
	ORDER BY Group, MGRNO;
quit;

Yegen · Posted 06-14-2017 07:38 AM

Thanks @gamotte. Your code does not take into account who the ID belongs to, so it just compares the distinct IDs within a given industry. However, what I am trying to do is as follows. Suppose there the two managers in the Group are A and B. Let's assume that in Industry=8321, Manager A owns firms with IDs 1234 and 2345, whereas Manager B owns 3514 and 7214. Then all of these rows will get a Column=1. Your code covers such cases. What I also need is the following case. Suppose in Inudstry=3152, Manager A owns firms with IDs 1884 and 1288, wheres Manager B owns firms with IDs 1884, 1288, and 8321. In this case, what I need is that Control=1 for all of these rows. This case makes it a little bit trick, otherwise I could have just counted distinct IDs by MGRNO and Industry and generate Control=1 if they are all distinct.

In conclusion, Control=1 for two cases: (1) Whenever IDs are distinct within a given Industry and for the MGRNO pair (not just the Group, but rather Group + MGRNO) and (2) Whenever there are duplicate IDs within a given Industry for the particular Group + MGRNO, then if the duplicates are at least two (e.g., 1884 and 1288), then all of the rows of that Industry get a 1 for that particular Group. In any other case, Control=0.

I am trying to be as clear as possible. Although it may seem simple at first glance, when these two rules (stated above) need to be considered for a given Group and the "distinct" / "distinction" has to be idenfitied by Group + MGRNO, then it becomes tricky. At least I got stuck. Thanks again for trying to help out.

Shmuel · Posted 06-14-2017 06:30 AM

Assuming you are relating to our previous conversation, as in link:

https://communities.sas.com/t5/Base-SAS-Programming/Row-by-row-comparison-within-group/m-p/364832/hi...

You have missed some code accepted then. Try 3rd version:

data want_V3;
	set want_V1;
	by Group Industry ID;          /* ID added */
	retain lagID;
        if first.ID and last.ID;       /* line added */
	if lagID^=ID then Possible=1;
	lagID=ID;
run;

If you still have issues please post desired results of your current sample, and point to lines

that need change.

It is difficult to understand whats wrong from your post.

Yegen · Posted 06-14-2017 07:25 AM

Thanks @Shmuel for your reply. That particular code does not work well in this setting since it just looks for industries that have at least two IDs whereas now I am trying to analyze whether a manager owns an ID in a given industry and the other manager holds a different ID from the same industry. In the current setting, I am trying to determine:

Within a given group, do both managers have different IDs in a given industry? If yes, then Control=1 for those identified rows. However, the tricky part is that if a manager has an identical ID as the other manager but has no other IDs within the same industry, then Control=0 for all of the IDs of that industry (i.e., Control=0 for all of these identified rows that are in the same industry). For instance, Manager 1 may have ID 1234 and Manager 2 may have 1234 and 2341 where all of these IDs are in the same industry, then Control=0 for all of these rows. But if Manager 1 had 1234 and Manager 2 had only 2341, then Control=1. Or, if Manager 1 would have had 1234 and 2341, whereas Manager 2 had 4251 and 4159 (where all of these IDs are in the same industry), then Control=1. The simple case is that Control=1 whenever Manager 1 has completly different IDs compared to Manager 2 where both Managers are in the same Group. There is only one exception though making the case a little bit tricky, which is if there are at least two identical IDs (e.g., she owns 2351, 8451, and 9999) owned by a given Manager that the other Manager also owns in addition to other IDs (e.g., 2351, 8451, and 8521), then all rows get a Control=. This is the next case.
Within a given group, does one of the managers have at least two IDs which are in the same industry that the other manager also owns (i.e., do both managers have at least two identical IDs that are in the same Industry)? If yes, then Control=1 for those identified rows.
Else, Control=0.

Here is what the correct output looks like, I added comments about the cases described above next to the Controls:

DATA want;
 INPUT Mgrno Group ID Industry Control;
 DATALINES; 
 9999 1 00462610 1842 0
 9999 1 01737210 1842 0
 9999 1 12690310 2832 0
 8888 1 12690310 2832 0 
 8888 1 12690311 2832 0 
 8888 1 37158620 2124 0 
 9999 1 23298321 4238 1 
 9999 1 53228424 4238 1 
 8888 1 53228424 4238 1
 8888 1 23298321 4238 1 
 9999 1 12290311 1233 1 
 8888 1 01737211 1233 1 
 8888 1 99998881 1233 1
 7777 2 12690310 2832 1 
 3333 2 32690311 2832 1 
 7777 2 37158620 2124 0 
 7777 2 38138222 3321 1
 3333 2 44134422 3321 1 
 7777 2 53228424 4238 1 
 3333 2 77233421 4238 1 
 3333 2 12290311 1233 0 
 3333 2 93234320 1333 0 
;
RUN;

UPDATED: Please see below a screenshot with the comments. When I past the code here, it becomes messy and not readable.

Here is the sample which is the same one above except that I removed the control and pair columns:

DATA have;
	INPUT Mgrno Group ID Industry ;
	DATALINES; 
	9999 1 00462610 1842 
	9999 1 01737210 1842 
	9999 1 12690310 2832 
	8888 1 12690310 2832 
	8888 1 12690311 2832 
	8888 1 37158620 2124 
	9999 1 23298321 4238 
	9999 1 53228424 4238 
	8888 1 53228424 4238 
	8888 1 23298321 4238 
	9999 1 12290311 1233 
	8888 1 01737211 1233 
	8888 1 99998881 1233 
	7777 2 12690310 2832 
	3333 2 32690311 2832
	7777 2 37158620 2124 
	7777 2 38138222 3321 
	3333 2 44134422 3321 
	7777 2 53228424 4238 
	3333 2 77233421 4238  
	3333 2 12290311 1233 
	3333 2 93234320 1333  
;
RUN;

Shmuel · Posted 06-14-2017 11:00 AM

Let me understand the hierarchy and relations in the whole picture.

Looking in the data it seems to me (please correct):

Industry (e.g. 2832) may be composed of several firm IDs (12690310, 12690311 3269311)

Same industry (2832) may be divided into groups (1 , 2)

Same group (e.g. 1) may be composed of more then one industry (1842, 1233 ,...)

Same firm ID (12690310 ) may have more than one manager (7777, 8888, 9999)

but belongs to one industry only

Same manager (e.g. 7777) may serve in several industries (2832, 3321, 4238)

Same group-industry-id (1-2832-12690310) may be managed by more than one managers (8888, 9999)

Now let's define what are you focusing on - please correct and complete:

- managers (two or more) in same industry with all different firm IDs ( assigned with control=1)

- managers in the same group and same industries but completely different IDs (assigned with control=. ?)

- etc.

Please explain what is a GROUP in your data? what is composed of (industries ? employees ? other ? )

Yegen · Posted 06-14-2017 02:51 PM

Exactly, that is the correct hierarchy. But I would reverse the order to see a little bit:

There are multiple groups and each of these groups have a different pair of Managers (e.g., in Group 1 there are Managers 9999 and 8888, whereas in Group 3 there are Managers 9999 and 3333, so the pairs of managers are always distinct).
Each Group is composed of multiple industries, but a Manager may or may not serve in a given industry that belongs to that particular group (these industries are irrelevant and all rows with that industry should get a Control=0).
Within a given industry there may be several firms (whenever there is only one firm in a given industry, Control=0) and some of these firms are managed either BY one of them or both of the managers. It is possible that for a given industry, one of the managers may control only one ID and the other manager may control that particular iD as well in addition to managing other IDs. In this extreme example, all of the firms in that particular (firm) industry get a Control=0.

So the analysis is done at the Group-Industry level. Within each Group, there are only two distinct managers and some manager may or may not control IDs in a given industry. The key is to compare the IDs within a given industry that one manager controls to the IDs within the same industry (whenever avaiable in the same industry) that the other manager controls (any IDs of industries that are managed by only one manager get a Control=0).

Exactly, @Shmuel the rule to assing Control=1 is as follows:

- managers (two or more) in same industry with all different firm IDs ( assigned with control=1)
-> The comparison of firm IDs is at the group-industry level, so there are at most two managers in a given Group-Industry pair and all manager pairs are distinct (so there are never duplicats at the Group+MGRNO+ID level). So it is correct, Control=1 for group-industries with all different firm IDs.

- managers in the same group and same industry(ies) but completely different IDs (assigned with control=. ?)

-> In this case, we assign Control=1 whenever there are at least two identical firm IDs within the same group and same industry (but there could be also at least two identical firm IDs and other non-identical firm IDs within a given group-industry pair, we still assign Control=1).
However, if one of the managers controls only one firm in a given group and industry pair (but no other firms in that particular group-industry) and at the other manager controls the same ID (regardless of whether she managers other different IDs or not within the same industry), then control=0 for all IDs in that particular group-industry.
So whenever there is a duplicate firm ID in a group-industry pair and at least one of the managers controls that particular firm ID in that group-industry, then for all IDs in that group-industry Control=0.
However, whenever there are at least two or more duplicate IDs in a given group-industry pair or there is only one firm ID duplicate in that particular group-industry but both managers control more than two firm IDs, then Control=1. Every other case, Control=0.

I have edited your answer, marked in red. I'll be able to deal with next week. Shmuel.

Yegen · Posted 06-14-2017 03:14 PM

Here is a graph I just draw with the set of IDs managed by one of the two managers in a given Group.

All firms in Industry=1233 and Industry=1842 will get Contro=1 assigned. The reason why all firm IDs in Industry=2832 will get assigned Control=0 is that one of the managers was only controling a signle duplicate firm in that particular Industry and no other firms. This analysis is done at the Group level (i.e., distinct manager pair). It is possible that Manager 9999 may be in Group 1 and at the same time in Group 7, but the other Managers in Group 1 and 7 will be different (so the pairs are always distinct but the industry-manager pairs or industry-manager-firm_ID may not be distinct).

Yegen · Posted 06-14-2017 05:39 PM

@Shmuel, thanks for your helpful edit/comment in my reply. In the meantime, I was able to write a code that produces the output that is needed. I tried to figure out a way to just keep the rows where Control=1.

Is there an alternative way of doing this? I have tried to construct the steps such that I am only left with the control=1 rows. I think the sample captures fairly well all possible cases (as depicted in the picture I have posted earlier). I am really in rush since I have to prepare a report (for a academic research project) due this Friday that relies primarily on this identification. Any suggestion will be greatly appreciated! 🙂

DATA want;
 INPUT Mgrno Group ID Industry Control;
 DATALINES; 
 9999 1 00462610 1842 0
 9999 1 01737210 1842 0
 9999 1 12690310 2832 0
 8888 1 12690310 2832 0 
 8888 1 12690311 2832 0 
 8888 1 37158620 2124 0 
 9999 1 23298321 4238 1 
 9999 1 53228424 4238 1 
 8888 1 53228424 4238 1
 8888 1 23298321 4238 1 
 9999 1 12290311 1233 1 
 8888 1 01737211 1233 1 
 8888 1 99998881 1233 1
 7777 2 12690310 2832 1 
 3333 2 32690311 2832 1 
 7777 2 37158620 2124 0 
 7777 2 38138222 3321 1
 3333 2 44134422 3321 1 
 7777 2 53228424 4238 1 
 3333 2 77233421 4238 1 
 3333 2 12290311 1233 0 
 3333 2 93234320 1333 0 
;
RUN;

*Deleting the industries with only one MGRNO;
proc sql;
	create table work.output_V1 as 
	select *
 	from want
 	group by group, industry
   	having count(distinct mgrno) ne 1;
quit;

*Counting how many IDs a particular MGRNO controls in a given industry;		
proc summary nway;
	class mgrno group industry;
	var ID;
	output out=number(drop=_:) 
	n=count_identifer;
run;

*Adding back to the sample the num of IDs a manager controls in a given industry;
proc sql;
	create table work.output_V2 as 
	select * 
	from work.output_V1 as a 
	left join work.number as b 
	on a.mgrno=b.mgrno & a.group=b.group & a.industry=b.industry;
quit;

*Getting the duplicates;	
proc sort data=work.output_V2 nouniquekeys uniqueout=singles out=dup;
	by group industry id;
quit;

*Identifying the duplicates;
data work.duplicates;
	set dup;
	duplicates=1;
run;

proc sql;
	create table duplicates_final as
	select distinct group, industry, id, duplicates
	from duplicates
quit;

proc sql;
	create table output_V3 as 
	select * 
	from work.output_V2 as a 
	left join work.duplicates_final as b 
	on a.group=b.group & a.industry=b.industry & a.id=b.id;
quit;

proc stdize data = work.output_V3 out=work.output_V3 reponly missing=0;
	var duplicates;
quit; 

proc sql; 
	delete from work.output_V3 where count_identifer=1 & duplicates=1;
quit;


proc sql;
	create table work.output_Final as 
	select *
 	from work.output_V3
 	group by group,industry
   	having count(distinct mgrno) ne 1;
quit;

gamotte · Posted 06-15-2017 04:35 AM

Hello,

I haven't read all your explanations because I'm lazy but the following program seems to give the expected output :

proc sql noprint;
	/* We exclude industries associated with only one manager */
	CREATE TABLE Sample_wo_1_mgr_industries
	AS SELECT *
	FROM Sample
	GROUP BY Group, Industry
	HAVING count(DISTINCT MGRNO) ne 1;

	/* We identify IDs having both managers in the group */
	CREATE TABLE CNT_MGR AS
	SELECT DISTINCT MGRNO, GROUP, INDUSTRY, ID, count(*)=2 AS MGR2
	FROM Sample_wo_1_mgr_industries
	GROUP BY Group, Industry, ID
	;

	/* We set Control=1 if there is not exactly one ID prsent for both manager in the given Group/Industry , 0 otherwise */
	CREATE TABLE Sample_with_Control AS
	SELECT a.*, sum(b.Control,0) AS Control
	FROM Sample a
	LEFT JOIN (
		SELECT DISTINCT GROUP, INDUSTRY, sum(MGR2) AS SUM_MGR2, 1 AS Control
		FROM CNT_MGR
		GROUP BY Group, Industry
		HAVING sum(MGR2) ne 2
	) b
	ON b.Group=a.Group AND b.Industry=a.Industry
	ORDER BY a.Group, a.Industry, a.Mgrno, a.ID
	;
quit;

Yegen · Posted 06-17-2017 07:02 PM

That was such a wonderful SQL code, @gamotte! I really like it. Thanks so much for proposing an alternative way of doing it.

Cheers

Shmuel · Posted 06-16-2017 02:50 AM

@Yegen, as you find your own solution and got proposed code from @gamotte,

I would like to concentrate on method of represnting the designt of needs.

As a programmer, whose first language is not English, it is difficult to me to follow long text description,

therefore I liked your diagram.

1) relating to the diagram I miss group code.
Did you separate firm PSA from other firms in 1842 industry because of different group ?

2) I have noticed few categories in your data and summerized them in a table:

Category	Group	Industry	FirmID	Count_Mgrs	Count 2+	Control	Industry e.g.
A	any	any	any	1		0	2124
B	any	any	any	2+		see below	1842
C	any	any		max=1		1	1233
D	any	any		max=2+	1	0	2832
E	any	any		max=2+	2+	1	1842

I believe that such table makes work much easier, to check logic and to code the program.

You are invited to check the table and correct any error (my missunderstanding). Regards.

Yegen · Posted 06-17-2017 07:07 PM

@Shmuel, thanks so much for all of your helpful replies. I am learning a lot from you on how to improve the code and ask a question in a better, clearer way! I will definetly make more often use of diagrams in the future rather than posting long texts.

Regarding point (1), apologies I just realized that the industry code for that particular firm was correctly written in the diagram.

Regarding point (2), the table captures well the data except that the rows are by Firm ID + MGRNO. So there can be rows where there is the same Firm ID with different MGRNOs.

Thanks again.

Best regards

Shmuel · Posted 06-18-2017 01:30 AM

To make myself more clear, I have copied the decision table and marked:

- in green the BY variables

- in orange the computed variables

of caurse rows are by FirmID and MgrNO. I'm counting mannagers (count_mgrs) per Group-Industry_FirmID

in order to decide what to assign to the control variable. Then I check the counter per Group-Industry.

Categories C D E are sub categories of B and may be written as: B1 B2 B3 respectively.

Category	Group	Industry	FirmID	Count_Mgrs	Count 2+	Control	Industry e.g.
A	any	any	any	1		0	2124
B	any	any	any	2+		see below	1842
C	any	any		max=1		1	1233
D	any	any		max=2+	1	0	2832
E	any	any		max=2+	2+	1	1842

Yegen · Posted 06-18-2017 01:55 AM

Exactly, that is correct. The Count_MGRS is to count the number of managers in a given Group-Industry. I will shortly post the new code I thought of. I think that works better.

Classroom Training Available!