topic Re: Unsupervised clustering in SAS Data Science

Unsupervised clustering

chuie — Tue, 27 Nov 2018 22:30:34 GMT

Hi There,

I found this diagram and article where they did a statistical modeling and figure out the high risk group and then did a unsupervised clustering.

So I am not sure what is the point of doing unsupervised clustering as we already know what are the features(variables importance, nodes etc) that high risk group entails thru the statistical modeling.

This is a great article but couldn't understand the logic behind it .

Please help

Thanks

https://yougottabelieve.info/case-control-study-vs-cohort-study-retrospective

I found this article/a

Re: Unsupervised clustering

Reeza — Tue, 27 Nov 2018 23:41:31 GMT

Link doesn't work. From the diagram it looks like the High Risk group was used for the unsupervised clustering - and this is usually done to tell us what we don't know. Yes, we know some variable importance, but exactly how that falls out for this subgroup may be different. Unsupervised clustering may be counter to what we expect so it's a good step to go through to either confirm or reject assumptions.

Re: Unsupervised clustering

chuie — Wed, 28 Nov 2018 18:52:53 GMT

For some reason the article doesn't work. However I have pasted the unsupervised clustering section below. I still do not get what is its purpose as the PCA and relevant variables were already achieved for these high risk group from the decision tree/ variable of importance chart..

Unsupervised Clustering: Subgroup Analysis of
High-Risk Patients
We used principle component analysis [21] to reduce high
dimensional EMR features and identify clinically relevant
groups of patients of high risk for 6-month ED visit with similar
patterns of demographics, primary diagnosis and procedure,
and chronic disease conditions. The features for high-risk
patients were projected to a lower dimensional subspace with
largest variances. The K-means algorithm was applied to find
potential patient patterns for future 6-month ED visit [22]. We
used K=6 to generate the final six clusters. The technical details
are described in Multimedia Appendix 9. Clustering patterns
between retrospective and prospective cohorts were compared
to further validate our high-risk case finding algorithm. As part
of the health care management platform, our predictive model
was integrated onto a Web-based dashboard to provide a
real-time visualization of the population profile with ED
6-month visits.

Re: Unsupervised clustering

Reeza — Wed, 28 Nov 2018 18:58:24 GMT

>The technical details
are described in Multimedia Appendix 9
Do you have access to that?

Re: Unsupervised clustering

chuie — Wed, 28 Nov 2018 20:18:57 GMT

it just explain how to do it not why 🙂

********************************************

Multimedia Appendix 9. Unsupervised clustering of high r isk population using
PCA.
To reduce high dimensional EMR features for detecting cohort pat tern, we used
principle component analysis (PCA) to divide the high r isk patients of future 6-
month ED visit identified by our algorithm in the prospective cohort into distinctive
groups, based on demographics, primary diagnosis and procedure, and chronic
disease conditions. The features for high-r isk patients are projected to a lower
dimensional subspace with largest variances.
Where Xi is EMR feature mat rix for each high-r isk patient, and wk is the set of
vectors of weights that map each patient feature vector Xi to a new vector of
principal component scores Ti
k. And we computed w1 by solving following objective
functions (1) and (2) and wk by i terating objective function (3) based on the first k-1
principal components,
And then K-means algorithm was applied on the top of principal components Ti
k
subspace of PCA to find potential patient patterns for future 6-month ED visit. We
used K=6 to implement init ial k means set for the algorithm and calculate the
Euclidean centroid m to generate finial clusters,
Where Ci is the ith cluster in total 6 clusters, and x represents the previous principal
components Tk.
Unique patterns revealed by the clustering results were analyzed to characterize
the high-r isk subjects identified by our ED algorithm. Unique patterns revealed by
the clustering results were analyzed to characterize the high-r isk subjects identified
by our ED algorithm.
1