## Methodology
Clustering US population tracts by similar SVI characteristics can provide insights about areas with undetected diseases (and other health outcomes) and/or areas that may be more prone to certain diseases (and other health behaviours/habits) due to living conditions or areas that have better access to healthcare support systems. Extracting such information can guide the efficient allocation of preventive resources and health/therapeutic access priorities. 

## Input Data
Social vulnerability factor percentiles

## Model
Use the KMeans algorithm that clusters data by separating samples into groups of clusters, characterized by their centers ("centroids"). The algorithm tries to group the data as close as possible to their centroid, by minimizing a criterion called 'inertia'. The number of clusters is set to 3 to observe relations with low, medium, and high values of of aggregate social vulnerability that isML-based as opposed to rule-based (equal weighting) overall SVI or thematic SVI metrics. 

## Explainability Metrics
Mean and standard deviation of each cluster is calculated independently for all the features.
1. The clustering profiles show the most important features in terms of the deviation of distribution between the clusters. 
2. The clustering heatmap shows the difference in feature values between clusters. The relative importance (triggers sorting) for each feature to a cluster is based on the average, the standard deviation of the values in a cluster along with the number of records, in comparison to the global values across the input data  ([t-score of Welch's t-test](https://en.wikipedia.org/wiki/Welch%27s_t-test)). 


## Analysis
Machine Learning Lab visualization and insights let you explore the results of clustering models. DSS automatically generates the most prominent observations about each cluster with explanations of the relative distribution and feature values. Three segments of Census Tracts have been created based on 16 social factor rankings (percentiles) at the tract-level used to identify potentially vulnerable communities aligned to themes of socioeconomic status, household characteristics, racial and ethnic minority status, and housing type/transportation. The segments characterize tracts into levels of low, mid, and high potential social vulnerability. Unsupervised clustering uses ML-driven signals of tract-level ranking differences across census tracts to drive the segment assignment, as opposed to rule-based (equal-weighted) aggregate rankings that comprise the CDC/ATSDR overall SVI index.