Re: PROC SQL counts by group

sasecn · Posted 05-07-2019 12:50 PM

I am trying to count of tasks done by workers' id (id variable in the data). The tasks can have sub-tasks. If one works on main and sub tasks, it should only count as 1 task done. If one only works on sub task (without working on main task), it also should count as 1 task done. In the example, worker 1 works on tasks: a, a_s, b. So, 2 tasks done; worker 2 works on: a_s, b. So, also 2 tasks done (even not working on main task: a). My sql code only can capture worker 1's count, but not worker 2 since i cannot find a way to identify who works on sub tasks only.

See code below, please help, thanks!

data test;
  input id $ task_id $ sub_task;
datalines;
1 a 0
1 a_s 1
1 b 0
2 a_s 1
2 b 0
;
run;

proc sql;
  create table out as
    select *, 
	       count(distinct case when not sub_task then task_id end) as done
	from test
	group by id
;
quit;

novinosrin · Posted 05-07-2019 01:07 PM

Hello @sasecn Should your pattern of subtask follow a pattern such that it takes the value of main_task with a suffix _s , extracting the first char would make it easy. In any event, you can apply the same logic accordingly

So



data test;
  input id $ task_id $ sub_task;
datalines;
1 a 0
1 a_s 1
1 b 0
2 a_s 1
2 b 0
;
run;

proc sql;
create table want as
select *,count( distinct first(task_id)) as done
from test
group by id;
quit;

sasecn · Posted 05-07-2019 01:46 PM

Hello:

Sorry for the mistake. I tried to make the code simple. But the real task id has more chars. e.g. the main task id is 2019-ABC-MAIN-1234; the sub task is 2019-ABC-MAIN-1234-S001 (or can be 2019-ABC-MAIN-1234-S001-S002). It is true that the sub task always like "main task id"-S001 ( or ....-S001-S002-... ).

I don't think the first() works. Sorry for making it more complex.

novinosrin · Posted 05-07-2019 01:49 PM

I expected something like that. Can you post a comprehensive sample so that the very logic can be adjusted?

The logic is still the same i.e to get the grouped pattern and count the distinct. Now you need to let us know the values that's a good representative of your sample.

sasecn · Posted 05-07-2019 02:12 PM

Okay, i tried to manage to get the last 4 digits from my task_id, and now the ids will be 4 digits followed by sub numbers, see example code. There is no special pattern in the 4 digits. Hope this make sense. So, worker 1 should have 7 tasks done.

data test;
   informat id task_id $25. sub_task;
   input id task_id sub_task;
   datalines;
1 1234 0
1 1234S001 1
1 1234S001S002 1
1 3456 0
1 9685 0
1 3648 0
1 8888 0
1 2222 0
1 3366 0
2 1234S001 1
2 5678 0
;
RUN;

novinosrin · Posted 05-07-2019 02:21 PM

HI @sasecnThank you for the response

"now the ids will be 4 digits followed by sub numbers"

Does this mean a Task or a subtask will conform to a given 4 digit number? Example 1234 series? If this is true, this in itself is a pattern.

"There is no special pattern in the 4 digits"

I don't think we are looking at any special pattern. The logic is basically to group Taskid and sub task into one group. That's all.

sasecn · Posted 05-07-2019 02:32 PM

You are right that the task id is always 4 digits number + sub number. The sub number could be S001 or S001S002, .... but always ended as SXXX. The reason i used distinct key work is because there are some duplicates records e.g.

1 1234 0

My problem is that i cannot find a way to handle workers who only works on sub tasks but not the main one. e.g. worker 2 works on task 1234S001, it should count as 1 independent task. But my original code miss-count this.

Does this make sense?

novinosrin · Posted 05-07-2019 02:40 PM

"You are right that the task id is always 4 digits number "

OK, so this goes back to my original idea, where in the same first 4 digits of all tasks will form one group. Sub numbers are not relevant to this logic

data test;
   informat id task_id $25. sub_task;
   input id task_id sub_task;
   datalines;
1 1234 0
1 1234S001 1
1 1234S001S002 1
1 3456 0
1 9685 0
1 3648 0
1 8888 0
1 2222 0
1 3366 0
2 1234S001 1
2 5678 0
;
RUN;

proc sql;
create table want as
select *, count(distinct substr(task_id,1,4)) as done
from test
group by id;
quit;

sasecn · Posted 05-07-2019 03:13 PM

Alright, i got your idea. Thx, that works in my situation. Now, i understand what you meant the pattern. I will run into other format of the task ids. Hope this does not bother you too much.

I will have to deal with another situation that have to handle different format of ids. For example, the main id will be either 4 digits or 8 digits, and this will be followed by another 4 digits that identify the status of required to redo the task. e.g.

for task 1234, required redo it then the full task id will be 1234R001, for task 1234S001 (8 digits main id) required redo, full id is 1234S001R001. Again, i want to count the tasks done, if worker 1 works on 1234 and 1234R001, this counts as 1 task done. if worker 1 works 1234S001 and 1234S001R001, this also counts as 1. if worker 2 only works on 1234R001, this counts as 1.

I will still have problems to count worker 2. What will be the idea to handle this? Sorry to bother you a lot.

data test;
   informat id task_id $25. redo_task;
   input id task_id redo_task;
   datalines;
1 1234 0
1 1234R001 1
1 1234S001 0
1 1234S001R002 1
1 3456 0
1 9685 0
1 3648 0
2 1234S001R001 1
2 1234R001 1
2 5678 0
;
RUN;

proc sql;
  create table out as
    select *, 
	       count(distinct case when not redo_task then task_id end) as done
	from test
	group by id
;
quit;

novinosrin · Posted 05-07-2019 03:37 PM

Hello @sasecn No bother at all.

Basically to give you a clear pic of how the logic works is

Taking your latest sample,

Is the following your expected result ?

Obs	id	task_id	t	grp
1	1	1234	1234	1
2	1	1234R001	1234	1
3	1	1234S001	1234	1
4	1	1234S001R002	1234	1
5	1	3456	3456	2
6	1	9685	9685	3
7	1	3648	3648	4
8	2	1234S001R001	1234	1
9	2	1234R001	1234	1
10	2	5678	5678	2

sasecn · Posted 05-07-2019 05:04 PM

Sorry, rush into something else. Will get back to you later.

sasecn · Posted 05-08-2019 10:22 AM

Sorry for the late reply. I want to count the numbers of tasks done. If worker works on main and re-do task, this only count as 1. e.g. worker 1 works on task: 1234 and 1234R001, this counts as 1 task done. e.g. worker 1 works on task: 1234S001 (note that main task id can be 4 or 8 or more digits) and 1234S001R001, this counts as 1 task done. e.g. worker 2 works only on task: 1234R001, this counts as 1 task done (consider as independent task even worker 2 didn't work on main task 1234).

So, the expected result will be worker 1 has 5 tasks done, worker 2 has 3 tasks done.

novinosrin · Posted 05-08-2019 11:18 AM

Hello @sasecn Looks like the problem here is for us to first identify what constitutes to Independent parent task, and it's sub task.

So taking this example

Obs	id	task_id	t	grp
1	1	1234	1234	1
2	1	1234R001	1234	1
3	1	1234S001	1234	1
4	1	1234S001R002	1234	1

When you say (note that main task id can be 4 or 8 or more digits) , do you have a classification like the letter S denotes subtask or otherwise? I need your help in determining the classification. That's the reason I presented Grouping GRP in the above. Also, when you say 4 or 8 digits. I see there digits(should mean numbers from 0-9) and characters. Once we have a clear understanding, we can proceed with the logic.

sasecn · Posted 05-08-2019 11:48 AM

To make it simple, we can ignore the letter S, say main task id can be 4 or 8 digits, e.g. 1234 or 12345678. We only need to identify if the main id followed by R001 or R002 always 4 digits to indicate that the task has been re-done. e.g. 12345678R001 (same as previous 1234S001) meaning this is re-do task of 12345678. If a worker works on both main and re-do task, this only counts as 1 task done; if a worker only works on main or only works on re-do task (consider as independent task), this counts as 1 task done.

e.g.

if a worker works on 12345678 and 12345678R001, this counts as 1 task done; if a worker works only on 12345678, this counts as 1 task done; if a worker works only on 12345678R001 (without working on 12345678), this also counts as 1 task done.

Hope this is clear to you.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!