Solved: assigning group identifiers

sas_user_k · Posted 10-18-2016 09:47 AM

Hi. I would like to assign two different kinds of group identifiers to each obsevation, using the example data shown below, one for 'company by date' and another for 'worker' within 'company by date' subgroups, along with assigning the maximum number for 'worker' for a different column:

data temp;
input company $1-2 date worker $6;
datalines;
AB 3 K
AB 3 K
AB 3 R
AB 3 R
AB 3 S
AB 3 T
AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V
CF 4 A
CF 4 A
CF 4 B
CF 4 B
CF 5 T
CF 5 T
CF 5 E
;

And the final output should be look like one shown below:

SAS Output

company date worker company_loop worker_loop worker_num

AB	3	K	1	1	4
AB	3	K	1	1	4
AB	3	R	1	2	4
AB	3	R	1	2	4
AB	3	S	1	3	4
AB	3	T	1	4	4
AB	4	T	2	1	3
AB	4	T	2	1	3
AB	4	U	2	2	3
AB	4	U	2	2	3
AB	4	V	2	3	3
CF	4	A	3	1	2
CF	4	A	3	1	2
CF	4	B	3	2	2
CF	4	B	3	2	2
CF	5	T	4	1	2
CF	5	T	4	1	2
CF	5	E	4	2	2

These kinds of operations are relatively straightforward using Stata, but when it comes to SAS, I cannot come up with some possible intuitive solutions for those operations. Is there any simple solution to those operations? Thank you!

RW9 · Posted 10-19-2016 08:07 AM

You need to add a reset of the worker_loop at first.company_date:

data temp; 
  input company $1-2 date worker $6;
  company_date=cats(company,"_",date);
datalines;
AB 3 K
AB 3 K
AB 3 R
AB 3 R
AB 3 S
AB 3 T
AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V
CF 4 A
CF 4 A
CF 4 B
CF 4 B
CF 5 T
CF 5 T
CF 5 E
; 
run;

proc sort data=temp; 
  by company_date worker;
run;

proc sql;
  create table TMP as 
  select  COMPANY_DATE, 
          count(distinct WORKER) as WORKER_NUM
  from    TEMP
  group by COMPANY_DATE;

  create table TEMP2 as 
  select  A.*, 
          B.WORKER_NUM
  from    TEMP A
  left join TMP B
  on      A.COMPANY_DATE=B.COMPANY_DATE;
quit;

data temp3;
  set temp2;
  retain company_loop worker_loop;
  by company_date worker;
  if first.company_date then do;
    company_loop=sum(company_loop,1);
    worker_loop=0;
  end;
  if first.worker then worker_loop=sum(worker_loop,1);
run;

I would advise to not use Tabs in your code as this renders differently over different viewers - you can change this so that when you press tab it puts in X amount of spaces (in Base SAS it is under Tools->Options->Enhanced Editor->General Tab, insert spaces for tabs.

View solution in original post

RW9 · Posted 10-18-2016 09:57 AM

Yes, quite simple. In a datastep, do a by with all the group levels, and have each of your output variables as retain:

data want;
  set have;
  retain company_loop worker_loop worker_num;
  by company worker;
  if first.company then company_loop=sum(company_loop,1);
  if first.worker then worker_loop=sum(worker_loop,1);
run;

What I do not understand is what worker_num is, and how it is defined?

sas_user_k · Posted 10-18-2016 11:57 AM

Thank you for your response, but there are three emaining issues for that problem:

1) First, I would like to use unique combinations of 'company by date' for assigning the first group identifiers to each observation, not just 'company', so 'company: AB - date: 3' combination should be different from 'company: AB - date: 4' combination.

2) Within each subgroup defined by unique 'company-date' combination, I would like to assign unique identifer for each worker.

3) As for the 'worker_num', it is the number of unique workers within a certain 'company-date' pair subgroup. For example, within the subgroup shown below ('company: AB - date: 4' combination subgroup), there are three unique workers ('T', 'U', 'V') within this subgroup. Thank you!

AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V

RW9 · Posted 10-18-2016 12:25 PM

Well, these two points:

1) First, I would like to use unique combinations of 'company by date' for assigning the first group identifiers to each observation, not just 'company', so 'company: AB - date: 3' combination should be different from 'company: AB - date: 4' combination.

2) Within each subgroup defined by unique 'company-date' combination, I would like to assign unique identifer for each worker.

Are both covered by one answer - create a variable which contains your unique by group, then replace company in my datastep, with this new variable which contains the distinct group.

The second part would be a separate step, you need to find the number of workers per group and join this back to the main dataset, then use that in the datastep, maybe something like:

proc sql;
  create table TMP as
  select COMPANY_DATE,
         count(distinct WORKER) as NUM_WORKERS
  from   HAVE
  group by COMPANY_DATE;

  create table HAVE1 as
  select A.*,
         B.NUM_WORKERS
  from   HAVE A
  left join TMP B
  on     A.COMPANY_YEAR=B.COMPANY_YEAR;
quit;

data want;
  set have;
  retain company_loop worker_loop worker_num;
  by company_date worker;
  if first.company_date then company_loop=sum(company_loop,1);
  if first.worker then do;
    worker_loop=sum(worker_loop,1);
    num_workers=num_workers-1;
  end;
run;

sas_user_k · Posted 10-18-2016 02:18 PM

(I have no idea why my reply keeps being deleted after posting it. I upload it again.)

Thank you for your answer, but I have one final question. I would like to assign unique 'worker' identifiers ('worker_loop') within a certain 'company-date' combination subgroup ('company_loop'). For example,

company date company_loop worker worker_loop

AB 3 1 K 1

AB 3 1 R 2

AB 3 1 S 3

AB 3 1 T 4

AB 4 2 T 1

AB 4 2 U 2

AB 4 2 V 3

CF 4 3 A 1

CF 4 3 B 2

...

As you can see above, the 'worker_loop' should restart to 1 when the subgroup ('company_loop') is changed while taking into account unique 'worker' cases within the subgroup. Is there any way that I can assign identfiers within the subgroup? Thank you!

RW9 · Posted 10-19-2016 04:26 AM

Ok, so what have you tried? The code here:

data want;
  set have;
  retain company_loop worker_loop worker_num;
  by company worker;
  if first.company then company_loop=sum(company_loop,1);
  if first.worker then worker_loop=sum(worker_loop,1);
run;

Shows you everythnig you need to know about grouping the data, retain counts, and you just need to manipulate your data to get this cmpany date variable.

sas_user_k · Posted 10-19-2016 07:38 AM

Thank you but the code you suggested doesn't seem to work properly. Here are the codes that I have tried based on your previous recommendations and the corresponding results. The code blocks #(1) and (2) work properly, but #(3) doesn't. Please pay attention to the code block #(3).

data temp; 
input company $1-2 date worker $6;
datalines;
AB 3 K
AB 3 K
AB 3 R
AB 3 R
AB 3 S
AB 3 T
AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V
CF 4 A
CF 4 A
CF 4 B
CF 4 B
CF 5 T
CF 5 T
CF 5 E
; 

proc sort data=temp; 
by company date worker; 

/* (1) For assining 'company by date' combination identifiers */
data temp2;
  set temp;
  retain company_loop;
  by company date worker;
  if first.date then company_loop=sum(company_loop,1);
run;

/* (2) For assigning the number of unique workers withint each 'company by date' subgroup' */
proc sql;
	create table tmp
	as select company_loop, count(distinct worker) as worker_num
	from temp2
	group by company_loop;

	create table temp3
	as select a.*, b.worker_num
	from temp2 a left join tmp b
	on a.company_loop=b.company_loop;
quit;

proc sort data=temp3;
	by company_loop worker;
run;

/* (3) For assigning ID for each unique 'worker' within each 'company by date' subgroup - doesn't work */
data temp4;
	set temp3;
	retain worker_loop;
	by company_loop worker;
	if first.worker then worker_loop=sum(worker_loop,1);
run;

proc print data=temp4;
run;

Here is the result. The columns 'company_loop' and 'worker_num' are the ones that I want to derive, but for the column 'worker_loop', it's not.

SAS Output

The SAS System

company date worker company_loop worker_num worker_loop

AB	3	K	1	4	1
AB	3	K	1	4	1
AB	3	R	1	4	2
AB	3	R	1	4	2
AB	3	S	1	4	3
AB	3	T	1	4	4
AB	4	T	2	3	5
AB	4	T	2	3	5
AB	4	U	2	3	6
AB	4	U	2	3	6
AB	4	V	2	3	7
CF	4	A	3	2	8
CF	4	A	3	2	8
CF	4	B	3	2	9
CF	4	B	3	2	9
CF	5	E	4	2	10
CF	5	T	4	2	11
CF	5	T	4	2	11

RW9 · Posted 10-19-2016 08:07 AM

You need to add a reset of the worker_loop at first.company_date:

data temp; 
  input company $1-2 date worker $6;
  company_date=cats(company,"_",date);
datalines;
AB 3 K
AB 3 K
AB 3 R
AB 3 R
AB 3 S
AB 3 T
AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V
CF 4 A
CF 4 A
CF 4 B
CF 4 B
CF 5 T
CF 5 T
CF 5 E
; 
run;

proc sort data=temp; 
  by company_date worker;
run;

proc sql;
  create table TMP as 
  select  COMPANY_DATE, 
          count(distinct WORKER) as WORKER_NUM
  from    TEMP
  group by COMPANY_DATE;

  create table TEMP2 as 
  select  A.*, 
          B.WORKER_NUM
  from    TEMP A
  left join TMP B
  on      A.COMPANY_DATE=B.COMPANY_DATE;
quit;

data temp3;
  set temp2;
  retain company_loop worker_loop;
  by company_date worker;
  if first.company_date then do;
    company_loop=sum(company_loop,1);
    worker_loop=0;
  end;
  if first.worker then worker_loop=sum(worker_loop,1);
run;

I would advise to not use Tabs in your code as this renders differently over different viewers - you can change this so that when you press tab it puts in X amount of spaces (in Base SAS it is under Tools->Options->Enhanced Editor->General Tab, insert spaces for tabs.

sas_user_k · Posted 10-19-2016 08:11 AM

Thank you so much for all the responses and tips! The results are now what I want to derive.

assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

Re: assigning group identifiers

AB	3	K	1	1	4
AB	3	K	1	1	4
AB	3	R	1	2	4
AB	3	R	1	2	4
AB	3	S	1	3	4
AB	3	T	1	4	4
AB	4	T	2	1	3
AB	4	T	2	1	3
AB	4	U	2	2	3
AB	4	U	2	2	3
AB	4	V	2	3	3
CF	4	A	3	1	2
CF	4	A	3	1	2
CF	4	B	3	2	2
CF	4	B	3	2	2
CF	5	T	4	1	2
CF	5	T	4	1	2
CF	5	E	4	2	2

AB	3	K	1	4	1
AB	3	K	1	4	1
AB	3	R	1	4	2
AB	3	R	1	4	2
AB	3	S	1	4	3
AB	3	T	1	4	4
AB	4	T	2	3	5
AB	4	T	2	3	5
AB	4	U	2	3	6
AB	4	U	2	3	6
AB	4	V	2	3	7
CF	4	A	3	2	8
CF	4	A	3	2	8
CF	4	B	3	2	9
CF	4	B	3	2	9
CF	5	E	4	2	10
CF	5	T	4	2	11
CF	5	T	4	2	11

AB	3	K	1	1	4
AB	3	K	1	1	4
AB	3	R	1	2	4
AB	3	R	1	2	4
AB	3	S	1	3	4
AB	3	T	1	4	4
AB	4	T	2	1	3
AB	4	T	2	1	3
AB	4	U	2	2	3
AB	4	U	2	2	3
AB	4	V	2	3	3
CF	4	A	3	1	2
CF	4	A	3	1	2
CF	4	B	3	2	2
CF	4	B	3	2	2
CF	5	T	4	1	2
CF	5	T	4	1	2
CF	5	E	4	2	2

AB	3	K	1	4	1
AB	3	K	1	4	1
AB	3	R	1	4	2
AB	3	R	1	4	2
AB	3	S	1	4	3
AB	3	T	1	4	4
AB	4	T	2	3	5
AB	4	T	2	3	5
AB	4	U	2	3	6
AB	4	U	2	3	6
AB	4	V	2	3	7
CF	4	A	3	2	8
CF	4	A	3	2	8
CF	4	B	3	2	9
CF	4	B	3	2	9
CF	5	E	4	2	10
CF	5	T	4	2	11
CF	5	T	4	2	11

Registration is open

AB	3	K	1	1	4
AB	3	K	1	1	4
AB	3	R	1	2	4
AB	3	R	1	2	4
AB	3	S	1	3	4
AB	3	T	1	4	4
AB	4	T	2	1	3
AB	4	T	2	1	3
AB	4	U	2	2	3
AB	4	U	2	2	3
AB	4	V	2	3	3
CF	4	A	3	1	2
CF	4	A	3	1	2
CF	4	B	3	2	2
CF	4	B	3	2	2
CF	5	T	4	1	2
CF	5	T	4	1	2
CF	5	E	4	2	2

AB	3	K	1	4	1
AB	3	K	1	4	1
AB	3	R	1	4	2
AB	3	R	1	4	2
AB	3	S	1	4	3
AB	3	T	1	4	4
AB	4	T	2	3	5
AB	4	T	2	3	5
AB	4	U	2	3	6
AB	4	U	2	3	6
AB	4	V	2	3	7
CF	4	A	3	2	8
CF	4	A	3	2	8
CF	4	B	3	2	9
CF	4	B	3	2	9
CF	5	E	4	2	10
CF	5	T	4	2	11
CF	5	T	4	2	11