The topic modeling flow zone unveils abstract “topics” that occur in the document corpus. This flow zone runs [LDA topic modelling](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html) over the concatenated text output based on the keyword search from the previous flow zone. Additional NLP analysis is performed in a document database.

![Topic Modeling.png](5RRYGXzJm0J6)

- **Input**: the [category](managed_folder:WSwcnOo2) consists of the aggregated extracted windows (200 character chunks) extracted from the document after running a keyword search over the input text file. This text column "key_word_text" is the input to our LDA model. For example, a document is sent through the pipeline and any keyword character that is found breaks into a 200 character chunk. These 200 character chunks are then all concatenated into one long string as an input for the LDA topic modeling algorithm. This decreases noise by only sending in relevant paragraphs, instead of the entire 200+ page 10k report.
- **Topic Modeling**:  Latent Dirichlet Allocation (LDA) is an example of a topic model that is used to classify text in a document to a particular topic. We run this model over our extracted text to cluster our document corpus into 3 abstract topics. 
- **Word Clouds**:  A world cloud is generated for each topic found from the LDA model to visually illustrate the most common words per topic. Each word cloud is saved down as an insight and published on the [interactive document intelligence dashboard](dashboard:ti5aDZY).
- **Output**: The trained LDA topic model is saved down to the [topic modeling insights folder](managed_folder:VnceDBRs). The  [pyldavis insight](insight:pyldavis_full) provides an interactive visualization representation of the top 30 words found per topic based on the results of the LDA topic model. This view and additional word clouds for each topic are populated the [interactive document intelligence dashboard](dashboard:ti5aDZY). Please note, the LDA topic model displayed in the dashboard was pre-trained on a larger document corpus to show as illustrative cluster results.
