In this zone, documents are analyzed with respect to the keywords that were set in the project setup.

![Window Extraction : Sentiment Analysis.png](j2vrmUraeA8l)

The window extraction and sentiment analysis flow zone consists of searching through the document to extract windows based on a keyword list or a category list and applying [FinBERT](https://huggingface.co/ProsusAI/finbert) to the windowed result to analyze sentiment.

- **Input**: the input of this flow zone takes a [folder](managed_folder:cPB8dLy4) and applies a prepare recipe to simplify the text and company names. 
- **Text Extraction**: the [text extraction recipe](recipe:compute_WSwcnOo2) takes in the prepared dataset and finds an extracted window based on the key words / categories entered. The window threshold (e.g. the number of characters surrounding the key word) can be customized but as a default is set to 100 characters.
    -  _Keyword Category Extraction_: the search file of terms, stored in [keywords](dataset:keywords), is used to search for multiple keywords pertaining to a specific category (e.g. Environmental, Social, and Governance). Similar to the above explanation, any keyword "hit" creates a window surrounding that keyword. The main difference is that instead of storing the keyword as the key of the dictionary, the category is stored (e.g. "Environmental" instead of the keyword "global"). The concatenated text string of  all  extracted windows in one document is leveraged for topic modeling in the subsequent flow zone.
- **FinBERT Sentiment Analysis**: Once a window has been extracted from the document, that window is sent through a pre-trained FinBERT model in order to analyze the sentiment. FinBERT will predict a sentiment score on each window between -1 and 1 as well as a subsequent negative, neutral, and positive label.
- **Count Occurences**: an [additional python recipe](recipe:compute_count_occurence) is applied to the [category](managed_folder:WSwcnOo2) to aggregate the total amount of references per category. This is used downstream in the [time series frequency analysis dashboard](dashboard:ti5aDZY)
