Solved: Creating groups based on cumulative sum

ResegoM · Posted 04-14-2022 08:51 AM

Hi there,

I was wondering if someone could help me with some coding.
I have a dataset that I need help grouping. The highlighted column (BR_grp) is the required column. This is created by looking at the cumulative sum for "size_total", where a new group is created once the cumulative sum has reached 2.5%.

I have also named the groups in "BR_grp" according to the corresponding customer age (it is preferable to have it like this however I am open to other solutions). I have tried creating a counter with multiple arrays etc and nothing seems to work.

I did create the variable "cum_size1" which keeps "size_total" variable if the cumulative size is less than 2.5% and 0 for anything greater... maybe that might help?
I've attached an excel sheet below.
Thank you 🙂

Ksharp · Posted 04-14-2022 09:14 AM

data have;
infile cards expandtabs;
input CUST_AGE	Good	Bad	size_total :percent8.;
format size_total percent8.2;
cards;
.	859	151	4.00%
18	4	3	0.03%
19	14	7	0.08%
20	52	16	0.27%
21	107	13	0.47%
22	152	28	0.71%
23	261	51	1.23%
24	390	54	1.76%
25	379	85	1.84%
26	511	80	2.34%
27	748	108	3.39%
28	791	142	3.69%
29	828	146	3.85%
30	969	180	4.55%
31	1112	158	5.02%
32	1104	177	5.07%
33	1110	169	5.06%
34	1064	189	4.96%
35	1163	215	5.45%
36	1144	164	5.17%
37	970	157	4.46%
38	826	122	3.75%
39	716	146	3.41%
40	694	127	3.25%
41	620	90	2.81%
42	599	64	2.62%
43	542	81	2.46%
44	426	70	1.96%
45	415	75	1.94%
46	343	73	1.65%
47	353	51	1.60%
48	371	42	1.63%
49	351	32	1.52%
50	244	27	1.07%
51	208	34	0.96%
52	200	31	0.91%
53	184	20	0.81%
54	176	10	0.74%
55	141	14	0.61%
56	137	17	0.61%
57	121	13	0.53%
58	78	11	0.35%
59	60	6	0.26%
60	49	20	0.27%
61	24	3	0.11%
62	15	8	0.09%
63	28	3	0.12%
64	13	0	0.05%
65	14	0	0.06%
66	28	1	0.11%
67	12	0	0.05%
68	11	0	0.04%
69	16	5	0.08%
70	5	1	0.02%
71	10	0	0.04%
72	12	3	0.06%
73	1	0	0.00%
74	2	0	0.01%
75	2	0	0.01%
76	.	1	.
77	3	0	0.01%
79	3	0	0.01%
80	1	1	0.01%
;

data temp;
 set have;
if missing(CUST_AGE) then delete;
if cum>0.025 then do;group+1;cum=0;end;
cum+size_total;
run;

proc sql;
select *,min(CUST_AGE) as min,max(CUST_AGE) as max
 from temp
  group by group
order by 1;
quit;

View solution in original post

PaigeMiller · Posted 04-14-2022 09:07 AM

Many of us will not (or cannot) download Excel files, as they are a security threat. The proper way to provide your example data is via these instructions (or by creating the equivalent SAS data step code by typing it in yourself). We also can't use screen captures of data to write programs.

First, I would just create consecutive group numbers rather than trying to create 18_23 in one step. It seems you should be able to do that from your cum_size1 variable. Then ... use PROC SUMMARY to find the minimum and max age in each group, and from there you can create group names like 18_23.

Please explain why 43 is in a group by itself and not 43_44.

--
Paige Miller

Kurt_Bremser · Posted 04-14-2022 09:15 AM

From what I see, 43 satisfies the > 2.5% rule on its own, but 44 does not. So from 44 on, it should be

44-45

46-47

48-49

50-52

53-??

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PaigeMiller · Posted 04-14-2022 09:16 AM

I'm not following your logic

43 is 2.46% so it is not >2.5%

--
Paige Miller

Kurt_Bremser · Posted 04-14-2022 09:20 AM

Oh, you're right. Damn those small digits in the picture, I read 2.64. (maybe I also get old-age dyslexia)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ResegoM · Posted 04-14-2022 10:05 AM

Hi ,
I am a little new here so I will definitely type out code for my data step the next time I may have a query.
For now I did see a solution below that worked.
Thank you for taking the time to help 🙂

Kurt_Bremser · Posted 04-14-2022 09:11 AM

Since here is the SAS Communities and not the Excel support forum, you should supply data in SAS style, which means a data step with datalines, posted into a code box. All this is described in your welcome email (if the communities moderators told us the truth).

Since your group name is dependent on the first and last member of a group, you will probably need to a double DO loop, in the first you determine the start and end of a group, in the second you reread all these values and set the new variable, and OUTPUT.

Or you read through your dataset and create a CNTLIN dataset (only one pass is needed) for PROC FORMAT that formats the existing age groups into the cumulative age groups.

You might even be able to trick one of the STAT procedures into creating your groups.

What do you intend to do later on with the new group? SAS statistical procedures will honor formatted values in CLASS or similar statements, so the format is probably the way to go.

Once we have usable source data and know where you want to go with this, we can come up with code suggestions.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 04-14-2022 09:14 AM

data have;
infile cards expandtabs;
input CUST_AGE	Good	Bad	size_total :percent8.;
format size_total percent8.2;
cards;
.	859	151	4.00%
18	4	3	0.03%
19	14	7	0.08%
20	52	16	0.27%
21	107	13	0.47%
22	152	28	0.71%
23	261	51	1.23%
24	390	54	1.76%
25	379	85	1.84%
26	511	80	2.34%
27	748	108	3.39%
28	791	142	3.69%
29	828	146	3.85%
30	969	180	4.55%
31	1112	158	5.02%
32	1104	177	5.07%
33	1110	169	5.06%
34	1064	189	4.96%
35	1163	215	5.45%
36	1144	164	5.17%
37	970	157	4.46%
38	826	122	3.75%
39	716	146	3.41%
40	694	127	3.25%
41	620	90	2.81%
42	599	64	2.62%
43	542	81	2.46%
44	426	70	1.96%
45	415	75	1.94%
46	343	73	1.65%
47	353	51	1.60%
48	371	42	1.63%
49	351	32	1.52%
50	244	27	1.07%
51	208	34	0.96%
52	200	31	0.91%
53	184	20	0.81%
54	176	10	0.74%
55	141	14	0.61%
56	137	17	0.61%
57	121	13	0.53%
58	78	11	0.35%
59	60	6	0.26%
60	49	20	0.27%
61	24	3	0.11%
62	15	8	0.09%
63	28	3	0.12%
64	13	0	0.05%
65	14	0	0.06%
66	28	1	0.11%
67	12	0	0.05%
68	11	0	0.04%
69	16	5	0.08%
70	5	1	0.02%
71	10	0	0.04%
72	12	3	0.06%
73	1	0	0.00%
74	2	0	0.01%
75	2	0	0.01%
76	.	1	.
77	3	0	0.01%
79	3	0	0.01%
80	1	1	0.01%
;

data temp;
 set have;
if missing(CUST_AGE) then delete;
if cum>0.025 then do;group+1;cum=0;end;
cum+size_total;
run;

proc sql;
select *,min(CUST_AGE) as min,max(CUST_AGE) as max
 from temp
  group by group
order by 1;
quit;

ResegoM · Posted 04-14-2022 10:05 AM

Thank you,
I typed this out and it worked perfectly 🙂

Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Re: Creating groups based on cumulative sum

Registration is open

Registration is open

SAS Training: Just a Click Away