# Overview

This project showcases the use of the  **graph analytics plugin** . It shows how the plugin can be used to perform feature engineering and  **improve the performances of machine learning models** . 
As an example, transactional  **data from a peer to peer (P2P) payment platform is used** . The goal is to be able to predict if a user of the platform is a fraudster. For this, we created four different models and compared their performances:
 -  **A naive approach** : model based on the P2P transactions, the fraudster rate is computed and used as the single explanatory variable (i.e. the percentage of known fraudsters around a user);
 -  **A “traditional approach”** : model based on features available in the dataset such as the number of cards per user;
 -  **A graph based approach** : model based on graph features (see Walkthrough section);
 -  **A combined approach** : all of the above combined in one model.
 
# Data
The data used was collected by Neo4j. It contains anonymized  **P2P financial information**  ([link to download](https://drive.google.com/drive/folders/1LaNFObKnZb1Ty8T7kPLCYlXDUlHU7FGa)).
The five input datasets are located in the  [1. Input](flow_zone:default) Flow zone:
- The [user](dataset:user), [device](dataset:device), [card](dataset:card), [IP](dataset:IP)  datasets respectively contain information about users, devices, credit cards and IP adresses.
- The [relationships](dataset:relationships) dataset describes the edges between each node in the graph.  **5 relationships are present in this dataset** :
  1. P2P: transaction from one user to another;
  2. HAS_CC: edge between a user and his/her credit card;
  3. HAS_IP: edge between a user and his/her IP address;
  4. USED: edge between a user and his/her device;
  5. REFERRED: edge for referrals amongst users (ignored for this project).

# Walkthrough
![Overview.png](Wh2ULf07JO8v)


## Data preparation
The first step of this project is data preparation.  Two data preparation steps  are taken:
 1. [2. Data preparation (other features)](flow_zone:XwAAkZc) for consolidating the graph database using the [join recipe](recipe:compute_relationships_joined) and the [groupby recipe](recipe:compute_relationships_joined_by_user_guid) to only keep one row per user;
 2. [3. Data preparation (Graph features)](flow_zone:K2gWumq) for feature engineering using the graph analytics plugin. The dataset is first  **filtered to only keep the P2P relationships**  using a [prepare recipe](recipe:compute_relationships_filtered). The graph analytics plugin is then used for data engineering:
    -  **A clustering using the relationships**  is performed using the [graph clustering recipe](recipe:compute_graph_clustering). It accepts source and target columns as input and produces the cluster ID to which a user is associated. It is used to identify the users who share a lot of common edges (i.e. making a lot of transactions within the same cluster);
    <br>
    ![Graph features.png](y2sTRw7xN1kl)

    -  **A [PageRank](https://en.wikipedia.org/wiki/PageRank) score**  using the [graph features recipe](recipe:compute_graph_features).
    <br>
    ![Graph clustering.png](VBO0yxlmR2ky)


The second step is splitting the train and test datasets in the [4. Modeling](flow_zone:cZMeMnr) Flow zone. As the naive approach requires to compute the average number of fraudsters around a user of interest, it is important to split the train and test data beforehand to avoid data leakage. 
The data is first randomly splitted between train and test using [a split recipe](recipe:split_relationships_joined_by_user_guid).
Both join recipes ([1](recipe:compute_train_joined), [2](recipe:compute_test_joined)) allow to enrich the train and test datasets with the neighbor's information of a user of interest. The groupby recipes ([1](recipe:compute_train_joined_by_start_user_guid), [2](recipe:compute_test_joined_by_start_user_guid)) allow to remove duplicates and compute the average the number of fraudsters around a user of interest.
The data is stacked using a [stack recipe](recipe:compute_data_stacked), enabling the origin feature to indicate if a user is part of the train or test set and using the [join recipe](recipe:compute_data_stacked_joined) to enrich the dataset with the graph features previously generated.

## Exploration
The graph can be explored in the chart section of the [the train dataset](dataset:train_joined) and displayed in a [dashboard](dashboard:dle1OBe).
<br>
![Graph vizualisation.png](0hCqgiEziFyw)


## Modeling
An [AutoML analysis](analysis:VTz371TV) is created to train a machine learning model to  **predict if a user is a fraudster or not** . 
In the Design Tab training parameters are adjusted. In particular:
 - In the Train/Test Set tab, the explicit extracts from the dataset is used. It allows to filter the input dataset for the train and test sets. 
 - In the metrics section, the AUC is chosen.
 - The feature handling tab is used to add or remove feature when training the four models.
 - The default algorithms are kept as is.
 
 **The four models are compared in a [model comparison dashboard](model_comparison:dsG43IMU).** 

## Conclusion

This project aimed to showcase the impact of enhancing a dataset with graph analytics features, including metrics like PageRank scores, on the performance of a machine learning model. The results demonstrated that this enrichment led to a modest yet  **noticeable improvement in the model's performance** . By leveraging the inherent connections and structural nuances within the data through graph analytics, the accuracy and predictive capacity of the models experienced a discernible boost. 

# Next: Explore your own graphs

## Technical requirements
This project:

 - Leverages the [Graph Analytics plugin](https://www.dataiku.com/product/plugins/graph-analytics/) available starting from Dataiku 8.02.
 - Leverages features available starting from Dataiku 10 ([model comparisons](https://doc.dataiku.com/dss/latest/mlops/model-comparisons/index.html)).
 - Uses the [Graph Analytics plugin](https://www.dataiku.com/product/plugins/graph-analytics/).
 
## How to reuse this project

The project can be downloaded [here](https://downloads.dataiku.com/public/dss-samples/EX_GRAPHANALYTICS/).

Once you have imported the project, you will simply have to build the whole Flow (Flow actions > build all > build required dependencies) to be able to explore the project in details.

 
 All the datasets are stored in filesystem so no remapping will be needed. However you have the option to [change the connection type](https://knowledge.dataiku.com/latest/data-sourcing/connections/concept-connection-changes.html#connection-changes) if you want to rely on a specific data storage type.
If you want to leverage certain parts of the Flow directly, keep in mind that Dataiku allows you to reuse elements at different levels. In particular it is possible to:

 - [duplicate a whole project](https://knowledge.dataiku.com/latest/getting-started/dataiku-ui/index.html)
 - [copy and paste entire subflows](https://knowledge.dataiku.com/latest/data-sourcing/connections/concept-connection-changes.html#copy-subflow)
 - [copy and paste recipes and remap input/output datasets](https://knowledge.dataiku.com/latest/data-preparation/visual-recipes/index.html)
 - [copy and paste preparation steps within a Prepare recipe](https://doc.dataiku.com/dss/latest/preparation/copy-steps.html)

It is also possible to directly integrate your datasets within the Flow by uploading them to the project and then changing the input/output of existing recipes. However, in this case, you will need to make sure that you propagate the schema (your own column names and storage types) properly (you will find an example here) and that you respect some constraints (in particular your dates should be in a parsed format). 

 # Related Resources
 - [Graph analytics plugin documentation](https://www.dataiku.com/product/plugins/graph-analytics/)
 - [ Neo4j blogpost](https://neo4j.com/developer-blog/exploring-fraud-detection-neo4j-graph-data-science-summary/)
 - Learn more about [graph visualization](https://blog.dataiku.com/building-a-graph-visualization-tool)
 - [Dataiku solution on "Drug Repurposing Knowledge Graph"](https://www.dataiku.com/solutions/catalog/drug-repurposing/)
 
