<span id="version" style="color: grey; float: right">Version 1.1.1</span>

Many firms have a large document corpus (both digitized and raw image) and untapped valuable opportunities that would come from harnessing insights and trends within this unstructured data. Firms are in particular turning towards unstructured data sources to capture additional attributes in order to adjust or confirm their analysis and discover new trends and insights. Many organizations rely on individuals to read sections of these documents, or search for relevant materials in an ad hoc manner, with no systematic way of categorizing and understanding the information and trends. 

This project provides business users with an opportunity to harness insights and trends within unstructured data. The project accepts any document set (digitized and raw image) as input, and each document is sent through a modular and reusable pipeline to automatically digitize documents, extract text, and consolidate data into a unified and searchable database. After consolidation, multiple NLP techniques are applied to this data to prepare, categorize and analyze textual data based on theme of interest (in this project: ESG), with additional theme modules available. The output of this flow provides business users with an interactive, purpose-built dashboard to analyze high level trends and drill down into aggregated insights via a custom dashboard.

## Project Structure 

 - A modular and reusable pipeline to rapidly and automatically digitize documents, extract text, and consolidate data into a unified and searchable database
 - Applies NLP techniques to prepare, categorize and analyze textual data based on theme of interest (in this project: ESG), with additional theme modules available
 - Provide business users with an simple, interactive, purpose-built dashboard to analyze high level trends and drill down into aggregated insights
 
## Features 
 
- Provide ability to parse all text documents that exist in digital form
- Identifying quantitative trends becomes trivial and these can be tracked by category and time
- Corpus of knowledge is formalized and permanently available and updated in system
- Structured reporting can be built, redesigned, and refreshed at will
- Leverages suite of NLP plugins including Tesseract for document transcription (OCR) and topic modeling for textual analysis
 
## Value 

 - Lower costs of staff churn
 - Faster time-to-insight
 - Creation of new resource that generates a previously unavailable level of rigorous insight into trends
 - Rapid deployment of end-to-end solution through a combination of visual recipes, DSS plugins, and custom code recipes 
 - Easy integration of custom web app on top of DSS flow to aid in the accessibility and interactive capabilities of NLP module results 

## Potential User Teams

 - Wealth Management teams, who need to process unstructured fund information for clients.
 - Marketing teams, who need to understand trends emerging within unstructured publications from competitors or peers.
 - CSR teams, who need to understand what themes and levels of engagement are present in their industry.
 - Note that while this specific project tackles data in the financial services space, its approach and modular design allows it to be readily applied to other industries.
