# Methodology
We use eligibility criteria and therapeutic area to inform the study similarity index, assuming that study protocols within the same therapeutic area with similar eligibility criteria should target a similar population and thus compete for the same pool of clinical research sites. Therefore, creating this study similarity index is essential for identifying potential sites for a given study protocol. 

# Feature Preprocessing
 1. Vectorization
 
  To extract information from the unstructured text from the eligibility criteria and therapeutic area, we harness the power of transfer learning by loading the pre-trained large language model and embed the unstructured text into vectorized features (See [Sentence embedding](article:8)). The feature engineering pipeline also converts the structured features into values between 0 and 1.
  
 2. Normalization
 
  As a result of feature vectorization, each feature can have various dimensionality. We apply a normalization equation to prevent the similarity index from being overweight on features with higher dimensions. Each feature is first divided by its norm, respectively, and all normalized features are divided by their total norm to make each vectorized study equal to 1. 

 <div class="alert">
 We empirically assign a feature weight to the therapeutic area feature by 2 to encourage the similarity index to prioritize study protocols in the same disease domain. 
</div>

 
# Input Data
**Features**: 
| Feature | Type | Values |Description|
|-----------|--------|----------|----------------|
|Cohort sex|Categorical|ALL, MALE, FEMALE|The sex of the study cohort|
|Cohort age group|Categorical|CHILD, ADULT, OLDER_ADULT|The age group of the study cohort|
|Brief Summary|free text||A brief description of the study|
|Inclusion Criteria|free text||A detailed description of inclusion criteria for eligible patients to be enrolled|
|Exclusion Criteria|free text||A detailed description of exclusion criteria for ineligible patients to be excluded|
|Mesh conditions|free text||Mesh terms representing the therapeutic area of the study|


# Model
We deploy the [Faiss](https://faiss.ai/index.html) model to create the study similarity index by calculating the cosine similarity between the vectorized study protocol features. Faiss is a Python package for efficient similarity search and clustering in large datasets. Developed by Facebook AI Research, it provides state-of-the-art algorithms for similarity indexing, making it particularly well-suited for high-dimensional data such as image features or text embeddings. 
