I have a very large time series data set that won't process all at once. I'm trying to split it into 4 groups and process separately. However, I need each CUSTOMER to only be in one of the groups. CUSTOMER is a character variable. I was trying to do this with the PROC SURVEYSELECT code below but it is giving an error--seems like I can't use GROUPS and CLUSTER together. I also tried PROC RANKS but I can't use it to group the character field CUSTOMER.
PROC SURVEYSELECT DATA=TIME_SERIES_INPUT OUT=TIME_SERIES_OUTPUT GROUPS=4 SEED=20180908 outall;
CLUSTER CUSTOMER;
RUN;
This code gives the following error:
ERROR: A SAMPLINGUNIT statement may not be specified with the GROUPS= option.
Thanks!
I can't say that I understand your entire problem, but it ought to be easy to split data sets into groups such that each customer is in only one group. (And it's not clear to me where the clustering comes in). Here's one way to do this:
UNTESTED CODE
proc sql;
create table customer_data_set as select distinct customer from have;
quit;
data customer_data_set;
set customer_data_set;
group=mod(_n_,4);
run;
proc sort data=have;
by customer;
run;
data want1 want2 want3 want4;
merge have customer_data_set;
by customer;
if group=0 then output want1;
else if group=1 then output want2;
else if group=2 then output want3;
else if group=3 then output want4;
run;
I can't say that I understand your entire problem, but it ought to be easy to split data sets into groups such that each customer is in only one group. (And it's not clear to me where the clustering comes in). Here's one way to do this:
UNTESTED CODE
proc sql;
create table customer_data_set as select distinct customer from have;
quit;
data customer_data_set;
set customer_data_set;
group=mod(_n_,4);
run;
proc sort data=have;
by customer;
run;
data want1 want2 want3 want4;
merge have customer_data_set;
by customer;
if group=0 then output want1;
else if group=1 then output want2;
else if group=2 then output want3;
else if group=3 then output want4;
run;
The problem is there are several records for each customer. The primary key for the table is CUSTOMER and TIME. When you select distinct, you aren't merging back with the total data set. I could add a merge statement to do this, but I was looking for a more efficient solution given the size of this data.
Thanks for your help!
All records for each customer are kept together in this method.
Ah, I didn't see the merge in the last step. Thanks!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.