# Methodology
The information on clinical sites is registered in free text and without an assigned unique identifier in the CT.gov database. Therefore, a named entity may have multiple name variations. Moreover, the geo point provided to each entity has the highest granularity to the city level in the database. (Our pipeline assigns more precise geo points to the sites registered with a valid zip code.) As part of the efforts to harmonize name variations of clinical sites, we apply a clustering algorithm to group names by their syntax similarity and geolocation information. We assume that the name variations sharing the same geolocation point and similar syntax represent the same entity. For example, if two sites go by University College London Hospitals and University College London Hospitals (UCLH), respectively, and share the same geo point. We assume that the two name variations represent the same entity.

# Input Data
Features: geo point, site name (vectorized)

# Model
Algorithm:DBSCAN
Hyperparameters: min_samples=2, eps=0.025, metric="cosine"
[Code recipe](recipe:compute_facilities_id_lookup_1) 

This task uses the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm from scikit-learn. DBSCAN is designed to identify clusters of data points based on their density in the feature space, allowing it to discover clusters of arbitrary shapes and handle noise effectively. The key idea behind DBSCAN is to define clusters as dense regions of data points separated by regions of lower point density. Since it doesn't require specifying the number of clusters in advance, it automatically assigns a different number of clusters to name variations with higher similarity while leaving those sharing little similarity to others unassigned in each geo point group. We set the hyperparameters for the DBSCAN model by validating the results of the clustering empirically. 

![Screenshot 2023-10-02 at 13.23.59.png](uQL9zARJwgMM)
Figure 1: Site name variations clustering for a geo point group. 
The figure illustrates vectorized site name variations' on a 2D projection. The DBSCAN model assigns labels to site name variations based on their similarity. The algorithm considers the isolated vectors as outliers and leaves them unlabeled. 



