Hi. I would like to assign two different kinds of group identifiers to each obsevation, using the example data shown below, one for 'company by date' and another for 'worker' within 'company by date' subgroups, along with assigning the maximum number for 'worker' for a different column:
data temp;
input company $1-2 date worker $6;
datalines;
AB 3 K
AB 3 K
AB 3 R
AB 3 R
AB 3 S
AB 3 T
AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V
CF 4 A
CF 4 A
CF 4 B
CF 4 B
CF 5 T
CF 5 T
CF 5 E
;
And the final output should be look like one shown below:
SAS Output
AB | 3 | K | 1 | 1 | 4 |
AB | 3 | K | 1 | 1 | 4 |
AB | 3 | R | 1 | 2 | 4 |
AB | 3 | R | 1 | 2 | 4 |
AB | 3 | S | 1 | 3 | 4 |
AB | 3 | T | 1 | 4 | 4 |
AB | 4 | T | 2 | 1 | 3 |
AB | 4 | T | 2 | 1 | 3 |
AB | 4 | U | 2 | 2 | 3 |
AB | 4 | U | 2 | 2 | 3 |
AB | 4 | V | 2 | 3 | 3 |
CF | 4 | A | 3 | 1 | 2 |
CF | 4 | A | 3 | 1 | 2 |
CF | 4 | B | 3 | 2 | 2 |
CF | 4 | B | 3 | 2 | 2 |
CF | 5 | T | 4 | 1 | 2 |
CF | 5 | T | 4 | 1 | 2 |
CF | 5 | E | 4 | 2 | 2 |
These kinds of operations are relatively straightforward using Stata, but when it comes to SAS, I cannot come up with some possible intuitive solutions for those operations. Is there any simple solution to those operations? Thank you!
You need to add a reset of the worker_loop at first.company_date:
data temp; input company $1-2 date worker $6; company_date=cats(company,"_",date); datalines; AB 3 K AB 3 K AB 3 R AB 3 R AB 3 S AB 3 T AB 4 T AB 4 T AB 4 U AB 4 U AB 4 V CF 4 A CF 4 A CF 4 B CF 4 B CF 5 T CF 5 T CF 5 E ; run; proc sort data=temp; by company_date worker; run; proc sql; create table TMP as select COMPANY_DATE, count(distinct WORKER) as WORKER_NUM
from TEMP group by COMPANY_DATE; create table TEMP2 as select A.*, B.WORKER_NUM from TEMP A left join TMP B on A.COMPANY_DATE=B.COMPANY_DATE; quit; data temp3;
set temp2; retain company_loop worker_loop; by company_date worker; if first.company_date then do; company_loop=sum(company_loop,1); worker_loop=0; end; if first.worker then worker_loop=sum(worker_loop,1); run;
I would advise to not use Tabs in your code as this renders differently over different viewers - you can change this so that when you press tab it puts in X amount of spaces (in Base SAS it is under Tools->Options->Enhanced Editor->General Tab, insert spaces for tabs.
Yes, quite simple. In a datastep, do a by with all the group levels, and have each of your output variables as retain:
data want; set have; retain company_loop worker_loop worker_num; by company worker; if first.company then company_loop=sum(company_loop,1); if first.worker then worker_loop=sum(worker_loop,1); run;
What I do not understand is what worker_num is, and how it is defined?
Thank you for your response, but there are three emaining issues for that problem:
1) First, I would like to use unique combinations of 'company by date' for assigning the first group identifiers to each observation, not just 'company', so 'company: AB - date: 3' combination should be different from 'company: AB - date: 4' combination.
2) Within each subgroup defined by unique 'company-date' combination, I would like to assign unique identifer for each worker.
3) As for the 'worker_num', it is the number of unique workers within a certain 'company-date' pair subgroup. For example, within the subgroup shown below ('company: AB - date: 4' combination subgroup), there are three unique workers ('T', 'U', 'V') within this subgroup. Thank you!
AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V
Well, these two points:
1) First, I would like to use unique combinations of 'company by date' for assigning the first group identifiers to each observation, not just 'company', so 'company: AB - date: 3' combination should be different from 'company: AB - date: 4' combination.
2) Within each subgroup defined by unique 'company-date' combination, I would like to assign unique identifer for each worker.
Are both covered by one answer - create a variable which contains your unique by group, then replace company in my datastep, with this new variable which contains the distinct group.
The second part would be a separate step, you need to find the number of workers per group and join this back to the main dataset, then use that in the datastep, maybe something like:
proc sql; create table TMP as select COMPANY_DATE, count(distinct WORKER) as NUM_WORKERS from HAVE group by COMPANY_DATE; create table HAVE1 as select A.*, B.NUM_WORKERS from HAVE A left join TMP B on A.COMPANY_YEAR=B.COMPANY_YEAR; quit; data want; set have; retain company_loop worker_loop worker_num; by company_date worker; if first.company_date then company_loop=sum(company_loop,1); if first.worker then do; worker_loop=sum(worker_loop,1); num_workers=num_workers-1; end; run;
(I have no idea why my reply keeps being deleted after posting it. I upload it again.)
Thank you for your answer, but I have one final question. I would like to assign unique 'worker' identifiers ('worker_loop') within a certain 'company-date' combination subgroup ('company_loop'). For example,
company date company_loop worker worker_loop
AB 3 1 K 1
AB 3 1 K 1
AB 3 1 R 2
AB 3 1 R 2
AB 3 1 S 3
AB 3 1 T 4
AB 4 2 T 1
AB 4 2 T 1
AB 4 2 U 2
AB 4 2 U 2
AB 4 2 V 3
CF 4 3 A 1
CF 4 3 A 1
CF 4 3 B 2
...
As you can see above, the 'worker_loop' should restart to 1 when the subgroup ('company_loop') is changed while taking into account unique 'worker' cases within the subgroup. Is there any way that I can assign identfiers within the subgroup? Thank you!
Ok, so what have you tried? The code here:
data want; set have; retain company_loop worker_loop worker_num; by company worker; if first.company then company_loop=sum(company_loop,1); if first.worker then worker_loop=sum(worker_loop,1); run;
Shows you everythnig you need to know about grouping the data, retain counts, and you just need to manipulate your data to get this cmpany date variable.
Thank you but the code you suggested doesn't seem to work properly. Here are the codes that I have tried based on your previous recommendations and the corresponding results. The code blocks #(1) and (2) work properly, but #(3) doesn't. Please pay attention to the code block #(3).
data temp;
input company $1-2 date worker $6;
datalines;
AB 3 K
AB 3 K
AB 3 R
AB 3 R
AB 3 S
AB 3 T
AB 4 T
AB 4 T
AB 4 U
AB 4 U
AB 4 V
CF 4 A
CF 4 A
CF 4 B
CF 4 B
CF 5 T
CF 5 T
CF 5 E
;
proc sort data=temp;
by company date worker;
/* (1) For assining 'company by date' combination identifiers */
data temp2;
set temp;
retain company_loop;
by company date worker;
if first.date then company_loop=sum(company_loop,1);
run;
/* (2) For assigning the number of unique workers withint each 'company by date' subgroup' */
proc sql;
create table tmp
as select company_loop, count(distinct worker) as worker_num
from temp2
group by company_loop;
create table temp3
as select a.*, b.worker_num
from temp2 a left join tmp b
on a.company_loop=b.company_loop;
quit;
proc sort data=temp3;
by company_loop worker;
run;
/* (3) For assigning ID for each unique 'worker' within each 'company by date' subgroup - doesn't work */
data temp4;
set temp3;
retain worker_loop;
by company_loop worker;
if first.worker then worker_loop=sum(worker_loop,1);
run;
proc print data=temp4;
run;
Here is the result. The columns 'company_loop' and 'worker_num' are the ones that I want to derive, but for the column 'worker_loop', it's not.
SAS Output
The SAS System |
AB | 3 | K | 1 | 4 | 1 |
AB | 3 | K | 1 | 4 | 1 |
AB | 3 | R | 1 | 4 | 2 |
AB | 3 | R | 1 | 4 | 2 |
AB | 3 | S | 1 | 4 | 3 |
AB | 3 | T | 1 | 4 | 4 |
AB | 4 | T | 2 | 3 | 5 |
AB | 4 | T | 2 | 3 | 5 |
AB | 4 | U | 2 | 3 | 6 |
AB | 4 | U | 2 | 3 | 6 |
AB | 4 | V | 2 | 3 | 7 |
CF | 4 | A | 3 | 2 | 8 |
CF | 4 | A | 3 | 2 | 8 |
CF | 4 | B | 3 | 2 | 9 |
CF | 4 | B | 3 | 2 | 9 |
CF | 5 | E | 4 | 2 | 10 |
CF | 5 | T | 4 | 2 | 11 |
CF | 5 | T | 4 | 2 | 11 |
You need to add a reset of the worker_loop at first.company_date:
data temp; input company $1-2 date worker $6; company_date=cats(company,"_",date); datalines; AB 3 K AB 3 K AB 3 R AB 3 R AB 3 S AB 3 T AB 4 T AB 4 T AB 4 U AB 4 U AB 4 V CF 4 A CF 4 A CF 4 B CF 4 B CF 5 T CF 5 T CF 5 E ; run; proc sort data=temp; by company_date worker; run; proc sql; create table TMP as select COMPANY_DATE, count(distinct WORKER) as WORKER_NUM
from TEMP group by COMPANY_DATE; create table TEMP2 as select A.*, B.WORKER_NUM from TEMP A left join TMP B on A.COMPANY_DATE=B.COMPANY_DATE; quit; data temp3;
set temp2; retain company_loop worker_loop; by company_date worker; if first.company_date then do; company_loop=sum(company_loop,1); worker_loop=0; end; if first.worker then worker_loop=sum(worker_loop,1); run;
I would advise to not use Tabs in your code as this renders differently over different viewers - you can change this so that when you press tab it puts in X amount of spaces (in Base SAS it is under Tools->Options->Enhanced Editor->General Tab, insert spaces for tabs.
Thank you so much for all the responses and tips! The results are now what I want to derive.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.