# Introduction
 [Clinical Sites Harmonization](flow_zone:NHLQLZ3) flow zone implements a harmonization pipeline to address the data quality issues in the facility data frame from the CT.gov dataset. For more information about the problem statement and methodology applied in this pipeline, please review [Clustering Analysis](article:9). 

# Recipes & Dataframes
![Screenshot 2024-04-12 at 10.29.29.png](t6SAfDp5PnDF)

**Outputs**
|   Recipe(s) | Description |   Output Dataset  |
|------------|------------|------------|
|[prepare_recipe1](recipe:compute_facilities_zip_strip),  [join_recipe2](recipe:compute_facilities_goecode), [code_recipe3](recipe:compute_facilities_geocode_filtered), [prepare_recipe4](recipe:compute_facilities_geocode_mapped), [window_recipe5](recipe:compute_facilities_geocode_mapped_windows), [code_recipe6](recipe:compute_facilities_geocode_coalesce)|1). Extracts geo point from zip code and harmonizes location information; 2). Maps to country-zip crosswalk; 3). text preprocessing for city and site names; 4). Creates geopoints from reference; 5). Removes one-to-many mappings; 6). Coalecses mapped geopoint and geopoint from the source. Fills cities and regions based on mapped geopoint |[facilities_geocode_coalesce](dataset:facilities_geocode_coalesce)|
|[distinct_recipe1](recipe:compute_facilities_distinct_prepared), [code_recipe2](recipe:compute_facilities_id_lookup_1), [join_recipe3](recipe:compute_facilities_prepared), [group_recipe4](recipe:compute_facilities_prepared_by_Facility_ID), [window_recipe5](recipe:compute_facilities_preferred_name), [join_recipe6](recipe:compute_facilities_w_preferred_name)|1). Selects distinct clinical site entities; 2). Clusters entities based on the similarity of site names without each geopoint; 3). Joins site ID with facility dataframe; 4-5). Selects the most commonly used name variation per site as the preferred name; 6). Joins preferred name with facility dataframe|[facilities_w_preferred_name](dataset:facilities_w_preferred_name)|
