Solved: Split data set into groups while clustering by a character variable

brianfpegg · Posted 09-08-2018 10:49 AM

I have a very large time series data set that won't process all at once. I'm trying to split it into 4 groups and process separately. However, I need each CUSTOMER to only be in one of the groups. CUSTOMER is a character variable. I was trying to do this with the PROC SURVEYSELECT code below but it is giving an error--seems like I can't use GROUPS and CLUSTER together. I also tried PROC RANKS but I can't use it to group the character field CUSTOMER.

PROC SURVEYSELECT DATA=TIME_SERIES_INPUT OUT=TIME_SERIES_OUTPUT GROUPS=4 SEED=20180908 outall;
CLUSTER CUSTOMER;
RUN;

This code gives the following error:

ERROR: A SAMPLINGUNIT statement may not be specified with the GROUPS= option.

Thanks!

PaigeMiller · Posted 09-08-2018 11:04 AM

I can't say that I understand your entire problem, but it ought to be easy to split data sets into groups such that each customer is in only one group. (And it's not clear to me where the clustering comes in). Here's one way to do this:

UNTESTED CODE

proc sql;
    create table customer_data_set as select distinct customer from have;
quit;

data customer_data_set;
    set customer_data_set;
    group=mod(_n_,4);
run;

proc sort data=have;
    by customer;
run;

data want1 want2 want3 want4;
    merge have customer_data_set;
    by customer;
    if group=0 then output want1;
    else if group=1 then output want2;
    else if group=2 then output want3;
    else if group=3 then output want4;
run;

--
Paige Miller

View solution in original post

PaigeMiller · Posted 09-08-2018 11:04 AM

I can't say that I understand your entire problem, but it ought to be easy to split data sets into groups such that each customer is in only one group. (And it's not clear to me where the clustering comes in). Here's one way to do this:

UNTESTED CODE

proc sql;
    create table customer_data_set as select distinct customer from have;
quit;

data customer_data_set;
    set customer_data_set;
    group=mod(_n_,4);
run;

proc sort data=have;
    by customer;
run;

data want1 want2 want3 want4;
    merge have customer_data_set;
    by customer;
    if group=0 then output want1;
    else if group=1 then output want2;
    else if group=2 then output want3;
    else if group=3 then output want4;
run;

--
Paige Miller

brianfpegg · Posted 09-08-2018 11:31 AM

The problem is there are several records for each customer. The primary key for the table is CUSTOMER and TIME. When you select distinct, you aren't merging back with the total data set. I could add a merge statement to do this, but I was looking for a more efficient solution given the size of this data.

Thanks for your help!

PaigeMiller · Posted 09-08-2018 11:32 AM

All records for each customer are kept together in this method.

--
Paige Miller

brianfpegg · Posted 09-08-2018 11:33 AM

Ah, I didn't see the merge in the last step. Thanks!

Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Catch up on SAS Innovate 2026

Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Re: Split data set into groups while clustering by a character variable

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away