topic Re: Reproducibility of the results - Hp Clus Procedure in SAS Enterprise Miner in SAS Data Science

Reproducibility of the results - Hp Clus Procedure in SAS Enterprise Miner

Hss_45 — Fri, 31 Dec 2021 18:51:48 GMT

I would like to ask about reproducibility problem with HP Clus procedure in SAS Enterprise Miner. I can not reproduce my results when I run the algorithm again with the same seed and same hyper-parameters. I have also tried saving the diagram as .xml and the whole path as SAS code and running them again, but I still didn't get the same results.

I would like to ask the following:

- Is there anyway to reproduce the same clustering solution from HP Clus procedure in SAS Enterprise Miner?

- To what extent it is tolerable not having/reproducing exactly the same results from HP procedures, like is there any acceptable solution for having different clustering solutions from the HP Clus algorithms started with the same seeds and same hyper-parameters?

- Does SAS provide any solution for the reproducibility problems for HP procedures in SAS Enterprise Miner?

- Is the problem of reproducibility in HP procedures due to the parallel processing computation infrastructure ?

Re: Reproducibility of the results - Hp Clus Procedure in SAS Enterprise Miner

sbxkoenk — Sat, 01 Jan 2022 15:17:04 GMT

Hello,

I think the issue with reproducibility is indeed linked to multithreaded and / or distributed computing.

You could try to run the HPCLUS procedure this way :

options cpucount=1 NOTHREADS;

PROC HPCLUS data=;
...;
performance nodes=0 NTHREADS=1;
run;

To evaluate the different clustering results and check if they are "overlapping" enough, you can use the techniques described in this paper :

SAS Globale Forum 2019 -- Paper 3409-2019
How to Evaluate Different Clustering Results?
Ralph Abbey, SAS Institute Inc.

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3409-2019.pdf

Cheers,

Koen

Re: Reproducibility of the results - Hp Clus Procedure in SAS Enterprise Miner

marathon2 — Fri, 07 Jan 2022 20:46:10 GMT

Your reply doesn't really address OP's main concern.

Are you confirming that one should expect HPCLUS PROC to produce different clustering results in different runs, even if the same seed is set? If so, how much should one expect the results to change from run to run (does the HPCLUS implementation have some nice statistical convergence properties despite differing results)?

Is there ANY way to ensure that PROC HPCLUS results are reproducible in different sessions given a fixed seed (in Enterprise Miner)?

I understand that PROC HPCLUS could use distributed computing/parallel processing and all that. However, the fact that a seed (or other parameters) doesn't guarantee the same outputs in different runs might sound concerning for the management. If you were an analyst, how would you convince to your manager to use SAS PROC HPCLUS (not reproducible) over R/Python packages (reproducible when a seed is set)?

Re: Reproducibility of the results - Hp Clus Procedure in SAS Enterprise Miner

sbxkoenk — Sat, 08 Jan 2022 00:40:55 GMT

Hello,

Yes,

I would say that observing (very) small differences is not unexpected with the SAS® Enterprise Miner™ High-Performance Procedures (like HPFOREST, HPCLUS, ...), even if the same seed is used.

The reason for the difference is the random variation that is associated with multi-threading.

You can get 100% reproducible results by disabling multi-threading, by specifying

performance nthreads=1;

[ NOTE: The SAS system options THREADS | NOTHREADS apply to the client machine on which the

SAS high-performance analytical procedures execute. They do not apply to the compute nodes in a

distributed environment. ]

If you prefer to have repeatability | reproducibility over performance, then try NTHREADS=1 until you encounter a situation in which doing so is not a practical solution. At that time, you can remove the NTHREADS=1 specification and take advantage of multi-threading.

I have no access to Enterprise Miner anymore (using VIYA Model Studio now), so I do not know about the equivalent for

performance nthreads=1;

in Enterprise Miner properties banner.

Anyway, k-means (HPCLUS algorithm) is a very special case. If you shuffle the observations (i.e. change the order), you will also get different results. But that's inherent to the k-means algorithm and how initial seeds are chosen.

Koen