# Data Source Requirements

The solution requires the following input datasets. Please  **read carefully**  as a number of features need to be prepared in the specified schema and name format. 

[Transactions_input](dataset:Transactions_input): This dataset should contain  **weekly** product quantity sales over time (year preferably) for individual HCP accounts in the following format. 
 - `account_id`  (_string_ ) : ID of the account provider/holder (HCP)
 - `product_id`( _string_ ) : ID of the product (brand)
 - `date`( _date_ ) : timestamp when the transaction was placed 
 - `product_quantity` (_int_ ) : number of products  

 **!!**  User should prepare the  **date format ```MM/DD/YY```**.

![transactions.png](YRbqi5Jk0r6o)

[Product_input](dataset:Product_input): A lookup between product_id, the market  brand_name for a drug and the unit_price. The dataset should contain:
 - `product_id`( _string_ ) : ID of the product (brand)
 - `brand_name`( _string_ ) : market drug name
 - `unit_price` (_double_ ) : market price for an individual unit

![product_image copy.png](dCYsEm8c8nU1)

[Providers_input](dataset:Providers_input): The providers table is unique at the specific health care provider level (variable account_id) of a given hospital or clinic (variable parent_account_id). Each record has information on the HCP such as main specialty, tenure and email preferences training date. These records provide insight into the specific practitioners to where outreach is directed and some basic information about the hospital in which they work. The dataset should contain the following columns: 

 - `parent_account_id` (_string_ ) : ID of the parent account
 - `parent_account_type` (_string_ ) : Type of the parent account  (hospital, clinic, private practice etc)
 - `account_id` (_string_ ) : ID of the account provider/holder (HCP)
 - `account_specialty` (_string_ ) : Main HCP specialty
 - `email_preferences` (_string_ ): Categorical feature (opt-in or opt-out) - it can also be replaced with whether you have the email or contact information of the account or not. 
 - `account_tenure` (_double_ ) : Duration that the account has been active. The user decides how to generate this value (first communication or first purchase) and the metric to be interpreted on the insights (days, months, years).
 
 ** :bangbang: **  In this dataset,  **users can add as many column characteristics available in their own data**  (categorical or numerical). These parameters will be used to analyze the different personas and their relation to brand adoption (preferences).  **Examples** : gender, age, other contact preferences, segmentation category.

![providers_image.png](eB67Sllp38Vm)


[Omnichannel_input](dataset:Omnichannel_input): has all the marketing outreach with an HCP for a given date over a period of time (that matched the transactions period). These data usually contain web log analytics, emails click through rates and other in person or digital interactions. Required variables are: 
 - `account_id`  (_string_ ) : ID of the account provider/holder (HCP)
 - `product_id`( _string_ ) : ID of the product (brand)
 - `campaign_id`( _string_ ) : if this is not available then fill it with a single value - it is only used on some visualisation metrics and if the user wishes to modify the project it can provide more granularity on filtering. 
 - `date`( _date_ ) : timestamp when the outreach took place

 ** :bangbang: **  In this dataset,  **users can add as many marketing communication channels available in their data**  in binary format.   **Examples** : promotions, in office and training events, social media ads. 
 
 ### Instructions:
 1. Users should prepare the  **date format ```MM/DD/YY```** that aligns with transaction data. 
 2. You should include at least a few characteristics; otherwise, none of the analyses in this project can be executed. 
 3. ALL characteristics should be of **binary 0/1 format** if the communication occurred in a given week.  If you have categorical features, you should preprocess them using dummy encoding or other binarization methods. This is necessary as later on we aggregate the data both by week and by HCP account. 
 4. Every column name should end with one  **suffix**  ``` _attempt , _success, avg_time_min``` in this exact format (lowercase  letters and underscore in front).  **Attempt**  shows the marketing effort (i.e. how many emails or calls invites send over a week),  **success**  indicates the user response (i.e. how many emails they open, or how many webcalls they participated in or website traffic activities) and  **avg_time_min**  an estimation of time interaction when relevant in minutes. These metrics enable consistent visualizations and can be used in an extension project (a further Dataiku solution) for channel affinity and quantification of each channel value. 
 
 Examples can be seen on the image below:

![channel_image.png](llDsIm1snDcm)

[to_score_input_c](dataset:to_score_input_c): dataset consists of the test set for the brand adoption modeling session. This dataset should contain ALL the features you select to activate on the DKU application  for the brand adoption training dataset and model. If features are missing, you will get an error from a check scenario running in the background.

See an example below: 
![to_score_image.png](3PkrUkDnf1DB)

