# Project Requirements 

## Version

This solution was designed on and is compatible on instances with  **Dataiku DSS 11** and above . Previous versions of DSS may require slight modifications to project flow. 

Please note that flow runtime will alter depending on your server specifications.

## Code Environment

It is recommended to create a Python 3.6 code environment called **solution_document-intelligence** and set that code environment as the default code environment for the project.

Required packages for this code environment are: 
>PyMuPDF==1.18.19
regex==2022.10.31
pyldavis==3.2.2
dash==2.7.0
nltk==3.6.7
torch==1.10.2
transformers==4.16.2
weasyprint==54.3
seaborn==0.11.2
scikit-learn==0.24.2
wordcloud==1.8.2.2
tokenizers==0.10.3

### Resources

Some resources are packaged in the code environment using the Resources menu in the code environment tab. More information about code environment resources can be found [here](https://doc.dataiku.com/dss/latest/code-envs/operations-python.html#code-env-resources-directory). The initialization script to create them is below:

```
from dataiku.code_env_resources import clear_all_env_vars
from dataiku.code_env_resources import set_env_path
from dataiku.code_env_resources import set_env_var
import os

# Clears all environment variables defined by previously run script
clear_all_env_vars()

## Hugging Face
# Set HuggingFace cache directory
set_env_path("HF_HOME", "huggingface")

import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

set_env_path("NLTK_DATA", "nltk_data")

# Import NLTK
import nltk

# Download model: automatically managed by NLTK, does not download
# anything if model is already in NLTK_DATA.
nltk.download('punkt', download_dir=os.environ["NLTK_DATA"])
```


## Plugins 

Generic documentation on how to install plugins on DSS can be found [here](https://doc.dataiku.com/dss/latest/plugins/installing.html). Details on "how to" install the mandatory plugins for this project. [Tesseract OCR Plugin](https://www.dataiku.com/product/plugins/tesseract-ocr/), please note you must install tesseract on the server that is running DSS. For mac users it is recommended to install tesseract through homebrew (brew install tesseract).






