topic Proc cluster outtree creates more observations than original input in Statistical Procedures

Proc cluster outtree creates more observations than original input

JBHUI — Sat, 15 Feb 2020 00:26:22 GMT

Hello - I am running a proc cluster procedure. My data is unique based on an AppID and nonmissing, and I am specifying the outtree option with the ID = AppID. However, the output dataset is almost double the size of my original input dataset. I noticed in the output dataset, there are new observations created for the cluster number, where the AppID is a missing value. I could not find any documentation that explains what is happening. Could you please help explain this? I would like to use proc tree to prune the clusters and this is causing errors. Thanks.

Re: Proc cluster outtree creates more observations than original input

Reeza — Sat, 15 Feb 2020 02:24:47 GMT

This is correct based on how it provides data - each cluster gets a line.

Are you running into issues with the TREE procedure or something else?

@JBHUI wrote:

Hello - I am running a proc cluster procedure. My data is unique based on an AppID and nonmissing, and I am specifying the outtree option with the ID = AppID. However, the output dataset is almost double the size of my original input dataset. I noticed in the output dataset, there are new observations created for the cluster number, where the AppID is a missing value. I could not find any documentation that explains what is happening. Could you please help explain this? I would like to use proc tree to prune the clusters and this is causing errors. Thanks.

Re: Proc cluster outtree creates more observations than original input

JBHUI — Tue, 21 Apr 2020 04:48:49 GMT

Thanks Reeza. Sorry for not thanking you sooner...I guess I am a little confused by the output in the outtree option. Suppose I have 75 observations in my original dataset "dset" below. Outtree produces a tree dataset with 149 observations which contains the original 75 plus additional observations for the cluster that begin with a "CL". Based on the output of the cluster procedure, I would like to limit my data to 5 clusters, and I would like to assign the original 75 observations a value of 1 to 5 that represent the 5 clusters. How would I go about doing this? Thank you so much.

proc cluster data = dset method = ward ccc outtree = tree;
id AppID;
var x1 x2 x3 x4;
run;