# Setup with Hortonworks Data Platform[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#setup-with-hortonworks-data-platform "Permalink to this headline")

This reference architecture will guide you through deploying on your DSS connected to your HDP:

* The fundamental local isolation code layer

* Impersonation for accessing HDFS datasets

* Impersonation for running Spark code over Yarn

* Impersonation for accessing Hive and Impala

In the rest of this document:

* `dssuser` means the UNIX user which runs the DSS software

* `DATADIR` means the directory in which DSS is running

* The two modes

* Prerequisites and required information

* Common setup

* Ranger-mode

+ Assumptions

+ Configure your cluster

- With Ambari

- Setup Ranger

- Additional setup for encrypted HDFS filesystems

+ Setup HDFS connections in DSS

+ Configure identity mapping

+ Setup Hive access

+ Authorization models

- One DSS connection per database

- One database per DSS project, multiple databases per DSS connection

- More complex setups

* DSS-ACL-synchronization-mode

+ Configure your cluster

- With Ambari

- Setup Ranger

- Additional setup for encrypted HDFS filesystems

+ Configure identity mapping

+ Setup Hive access

+ Initialize ACLs on HDFS connections

* Validate behavior

* Operations (Ranger mode)

+ Overview

+ Adding a project

+ Adding/Removing a user in a group

+ Adding / Removing access to a group

+ Interaction with externally-managed data

- Existing Hive table

- Synchronized Hive table

* Operations (ACL synchronization mode)

+ Overview

+ Adding a project

+ Adding a user to a group

+ Removing a user from a group

+ Adding access to a group

+ Removing access from a group

+ Interaction with externally-managed data

- Existing Hive table

- Synchronized Hive table

## The two modes[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#the-two-modes "Permalink to this headline")

There are two major ways to deploy UIF on HDP. The difference lies in how authorization is propagated on HDFS datasets

* Using Ranger. In this mode, Ranger will manage all authorization on HDFS data, both at the raw HDFS level and Hive level

* Using “DSS-managed ACL synchronization”. DSS will place HDFS ACLs on the managed datasets that it builds. Note that you will also need to leverage Ranger ACLs for Hive level.

**We recommend that you use Ranger preferably to DSS-managed ACLs**. Ranger lives in the NameNode and has more pervasive and flexible access, implying fewer limitations than DSS-managed ACLs. The three main advantages of using Ranger mode are:

* Centralized authorization in Ranger rather than requiring managing Ranger rules in addition to the HDFS ACLs.

* For some customer deployments, working around limitations in number of HDFS ACLs (the default DSS-managed ACLs require a larger number of ACLs per path, which can overflow the limit to 32 ACLs per path in HDFS)

* Appending in HDFS datasets using multiple users becomes possible.

## Prerequisites and required information[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#prerequisites-and-required-information "Permalink to this headline")

Please read carefully the Prerequisites and limitations documentation and check that you have all required information.

The most important parts here are:

* Having a keytab for the `dssuser`

* Having administrator access to the Ambari and Ranger interfaces

* Having root access to the local machine

* Having an initial list of end-user groups allowed to use the impersonation mechanisms.

## Common setup[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#common-setup "Permalink to this headline")

Initialize UIF (including local code isolation), see Initial Setup

## Ranger-mode[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#ranger-mode "Permalink to this headline")

### Assumptions[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#assumptions "Permalink to this headline")

In this model (as in the default DSS-managed-ACLs one btw) the security boundary is both the Hive database and an associated HDFS prefix.

There should be at least one Hive database per security tenant (ie set of different authorization rules). Within a given Hive database, all tables (and thus all DSS datasets) have by default the same authorization rules as the database itself.

In this model, each Hive database maps to a base directory of the HDFS filesystem. All datasets within this database are stored into a subdirectory of this base directory.

Authorization rules are defined in Ranger (Hive) at the database level and in Ranger (HDFS) at the folder level.

DSS HDFS connections can be set up to map DSS projects to these security tenants in several ways, depending on the application constraints, in particular:

* one DSS connection per tenant

* several tenants per connection, multiple projects per security tenant

### Configure your cluster[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#configure-your-cluster "Permalink to this headline")

Note

This part must be performed by the Hadoop administrator. A restart of your cluster may be required.

You now need to allow the `dssuser` user to impersonate all end-user groups that you have previously identified.

This is done by adding `hadoop.proxyuser.dssuser.groups` and `hadoop.proxyuser.dssuser.hosts` configuration keys to your Hadoop configuration (core-site.xml). These respectively specify the list of groups of users which DSS is allowed to impersonate, and the list of hosts from which DSS is allowed to impersonate these users.

The `hadoop.proxyuser.dssuser.groups` parameter should be set to a comma-separated list containing:

* A list of end-user groups which collectively contain all DSS users

* The group with which the `hive` user creates its files (generally: `hadoop` on HDP)

Alternatively, this parameter can be set to `\*` to allow DSS to impersonate all cluster users (effectively disabling this extra security check).

The `hadoop.proxyuser.dssuser.hosts` parameter should be set to the fully-qualified host name of the server on which DSS is running. Alternatively, this parameter can be set to `\*` to allow all hosts (effectively disabling this extra security check).

Make sure Hadoop configuration is properly propagated to all cluster hosts and to the host running DSS. Make sure that all relevant Hadoop services are properly restarted.

#### With Ambari[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#with-ambari "Permalink to this headline")

(NB: This information is given for information purpose only. Please refer to the official Hortonworks documentation for your HDP version)

* In Ambari, navigate to HDFS > Configs > Advanced, and search for “proxyuser”

* In “Custom core-site”, add two new properties:

+ Key: `hadoop.proxyuser.dssuser.groups`

+ Value: comma-separated list of Hadoop groups of your end users, plus hadoop

+ Key: `hadoop.proxyuser.dssuser.hosts`

+ Value: fully-qualified DSS host name, or `\*`

* Save changes, enter a description

* On the “Restart required” warning that appears, click “Restart” and “Restart all affected”

#### Setup Ranger[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#setup-ranger "Permalink to this headline")

* Create one or several root directories for DSS output directories.

* For each security tenant which you want DSS to use:

>

>

> 	+ create the database in HiveServer2

>

>

>

> 	>

> 	>

> 	> ```

> 	> beeline> CREATE DATABASE <db_name> LOCATION 'hdfs://<namenode>/<path\_to\_dir>';

> 	>

> 	> ```

> 	>

> 	>

> 	>

> 	+ grant access to the database in Ranger (Hive)

> 	+ grant access to the folder in Ranger (HDFS)

>

#### Additional setup for encrypted HDFS filesystems[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#additional-setup-for-encrypted-hdfs-filesystems "Permalink to this headline")

If DSS should access encrypted HDFS filesystems on behalf of users, you need to add specific Hadoop configuration keys to authorize impersonated access to the associated key management system (Hadoop KMS or Ranger KMS):

* `hadoop.kms.proxyuser.dssuser.groups` : comma-separated list of Hadoop groups of your end users

* `hadoop.kms.proxyuser.dssuser.hosts` : fully-qualified DSS host name, or `\*`

### Setup HDFS connections in DSS[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#setup-hdfs-connections-in-dss "Permalink to this headline")

Configure DSS managed HDFS connection(s) so that:

* Hive database for datasets map to one of the databases defined above

* HDFS paths for datasets map to the matching location for this database

* Management of HDFS ACLs by DSS is turned off (ACL synchronization mode: None)

### Configure identity mapping[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#configure-identity-mapping "Permalink to this headline")

If needed, go to Administration > Settings > Security and. update identity mapping.

Note

Due to various issues notably related to Spark, we strongly recommend that your DSS users and Hadoop users have the same name.

### Setup Hive access[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#setup-hive-access "Permalink to this headline")

* Go to Administration > Settings > Hive

* Fill in the HiveServer2 host and principal if needed, as described in Connecting to secure clusters

* Fill in the “Hive user” setting with the name of the user running HiveServer2 (generally: `hive`)

* Switch “Default execution engine” to “HiveServer2”

### Authorization models[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#authorization-models "Permalink to this headline")

There are several possible deployments of the above model, depending on the desired authorization and management model:

#### One DSS connection per database[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#one-dss-connection-per-database "Permalink to this headline")

* directly configure the database name in “Hive database”

* add the DSS project key to the table names, as in : “Hive table name prefix” = `${projectKey}\_`

* root path URI : path to database directory

* Path prefix: `${projectKey}/`

* optionally, you can restrict access to this DSS connection to its authorized DSS users, so the other ones do not see it at all

#### One database per DSS project, multiple databases per DSS connection[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#one-database-per-dss-project-multiple-databases-per-dss-connection "Permalink to this headline")

* Embed the DSS project key the Hive database name, as in: “Hive database” = `dataiku\_${projectKey}`

* Hive table name prefix can then be empty

* Root path URI must be a common parent to all database directories

* Embed the DSS project key in the HDFS path prefix, as in: “Path prefix” = `${projectKey}/`

* You need to create each database using the above command sequence from an admin account when creating a DSS project

#### More complex setups[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#more-complex-setups "Permalink to this headline")

More complex setups are possible using per-project variables, typically representing the security tenant to use for a given project, and expanding these variables in the database name or path prefix

## DSS-ACL-synchronization-mode[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#dss-acl-synchronization-mode "Permalink to this headline")

Note

In most cases, we recommend that you preferably use Ranger mode as detailed above

Warning

HDFS ACLs are not supported for Per-project single user permissions.

### Configure your cluster[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id1 "Permalink to this headline")

Note

This part must be performed by the Hadoop administrator. A restart of your cluster may be required.

You now need to allow the `dssuser` user to impersonate all end-user groups that you have previously identified.

This is done by adding `hadoop.proxyuser.dssuser.groups` and `hadoop.proxyuser.dssuser.hosts` configuration keys to your Hadoop configuration (core-site.xml). These respectively specify the list of groups of users which DSS is allowed to impersonate, and the list of hosts from which DSS is allowed to impersonate these users.

The `hadoop.proxyuser.dssuser.groups` parameter should be set to a comma-separated list containing:

* A list of end-user groups which collectively contain all DSS users

* The group with which the `hive` user creates its files (generally: `hadoop` on HDP)

Alternatively, this parameter can be set to `\*` to allow DSS to impersonate all cluster users (effectively disabling this extra security check).

The `hadoop.proxyuser.dssuser.hosts` parameter should be set to the fully-qualified host name of the server on which DSS is running. Alternatively, this parameter can be set to `\*` to allow all hosts (effectively disabling this extra security check).

Make sure Hadoop configuration is properly propagated to all cluster hosts and to the host running DSS. Make sure that all relevant Hadoop services are properly restarted.

#### With Ambari[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id2 "Permalink to this headline")

(NB: This information is given for information purpose only. Please refer to the official Hortonworks documentation for your HDP version)

* In Ambari, navigate to HDFS > Configs > Advanced, and search for “proxyuser”

* In “Custom core-site”, add two new properties:

+ Key: `hadoop.proxyuser.dssuser.groups`

+ Value: comma-separated list of Hadoop groups of your end users, plus hadoop

+ Key: `hadoop.proxyuser.dssuser.hosts`

+ Value: fully-qualified DSS host name, or `\*`

* Save changes, enter a description

* On the “Restart required” warning that appears, click “Restart” and “Restart all affected”

#### Setup Ranger[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id3 "Permalink to this headline")

* Create one or several root directories for DSS output directories.

* For each security tenant which you want DSS to use:

>

>

> 	+ create the database in HiveServer2

>

>

>

> 	>

> 	>

> 	> ```

> 	> beeline> CREATE DATABASE <db_name> LOCATION 'hdfs://<namenode>/<path\_to\_dir>';

> 	>

> 	> ```

> 	>

> 	>

> 	>

> 	+ grant access to the database in Ranger (Hive)

>

#### Additional setup for encrypted HDFS filesystems[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id4 "Permalink to this headline")

If DSS should access encrypted HDFS filesystems on behalf of users, you need to add specific Hadoop configuration keys to authorize impersonated access to the associated key management system (Hadoop KMS or Ranger KMS):

* `hadoop.kms.proxyuser.dssuser.groups` : comma-separated list of Hadoop groups of your end users

* `hadoop.kms.proxyuser.dssuser.hosts` : fully-qualified DSS host name, or `\*`

### Configure identity mapping[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id5 "Permalink to this headline")

If needed, go to Administration > Settings > Security and update identity mapping.

Note

Due to various issues notably related to Spark, we strongly recommend that your DSS users and Hadoop users have the same name.

### Setup Hive access[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id6 "Permalink to this headline")

* Go to Administration > Settings > Hive

* Fill in the HiveServer2 host and principal if needed, as described in Connecting to secure clusters

* Fill in the “Hive user” setting with the name of the user running HiveServer2 (generally: `hive`)

* Switch “Default execution engine” to “HiveServer2”

### Initialize ACLs on HDFS connections[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#initialize-acls-on-hdfs-connections "Permalink to this headline")

Go to the settings of the `hdfs\_managed` connection. Click on `Resync Root permissions`

If you have other HDFS connections, do the same thing for them.

## Validate behavior[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#validate-behavior "Permalink to this headline")

* Grant to at least one of your user groups the right to create projects

* Log in as an end user

* Create a project with key `PROJECTKEY`

* Perform the appropriate grants in Ranger

* As the end user in DSS, check that you can:

+ Create external HDFS datasets

+ Create prepare recipes writing to HDFS datasets

+ Synchronize datasets to the Hive metastore

+ Create Hive recipes to write new HDFS datasets

+ Use Hive notebooks

+ Create Python recipes

+ Use Python notebooks

+ Create Spark recipes

+ If you have Impala, create Impala recipes

+ If you have Impala, use Impala notebooks

+ Create visual recipes and use all available execution engines

## Operations (Ranger mode)[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#operations-ranger-mode "Permalink to this headline")

When you follow these setup instructions and use Ranger mode, DSS starts with a configuration that enables a per-project security policy with minimal administrator intervention.

### Overview[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#overview "Permalink to this headline")

* The HDFS connections are declared as usable by all users.

* Each project writes to a different HDFS folder.

* Each project writes to a different Hive database.

* Ranger rules grant permissions on the folder and database

The separation of folders and Hive database for each project are ensured by the naming rules defined in the HDFS connection.

Note

This default configuration should be usable by all, we recommend that you keep it.

### Adding a project[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#adding-a-project "Permalink to this headline")

In that setting, adding a project requires adding a Hive database and granting permissions to the project’s groups on the database.

* Create the project in DSS

* Add the groups who must have access to the project

By default, the new database is called `dataiku\_PROJECTKEY` where `PROJECTKEY` is the key of the newly created project. You can configure this in the settings of each HDFS connection.

As Hive administrator:

* As Hive administrator, using beeline or another Hive client, create the database

* As the Ranger administrator, perform the grants at both Hive and HDFS level

### Adding/Removing a user in a group[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#adding-removing-a-user-in-a-group "Permalink to this headline")

Grants are group-level, so no intervention is required when a user is added to a group.

### Adding / Removing access to a group[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#adding-removing-access-to-a-group "Permalink to this headline")

When you add project access to a group, you need to:

* Do the permission change on the DSS project

* Do the permission changes in Ranger

### Interaction with externally-managed data[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#interaction-with-externally-managed-data "Permalink to this headline")

In the Ranger setup, DSS does not manage any ACLs. It is the administrator’s responsibility to ensure that read ACLs on these datasets are properly set.

#### Existing Hive table[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#existing-hive-table "Permalink to this headline")

If externally-managed data has an existing Hive table, and no synchronization to the Hive metastore, you need to ensure that Hive-level permissions (Ranger) allow access to all relevant groups.

#### Synchronized Hive table[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#synchronized-hive-table "Permalink to this headline")

Even on read-only external data, you can ask DSS to synchronize the definition to the Hive metastore. In that case, you need to ensure that the HDFS-level permissions allow the Hive (and maybe Impala) users to access the folder.

## Operations (ACL synchronization mode)[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#operations-acl-synchronization-mode "Permalink to this headline")

First, remember that we recommend that you favor Ranger mode.

DSS starts with a configuration that enables a per-project security policy with minimal administrator intervention.

### Overview[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id7 "Permalink to this headline")

* The HDFS connections are declared as usable by all users.

* Each project writes to a different HDFS folder.

* Each project writes to a different Hive database.

The separation of folders and Hive database for each project are ensured by the naming rules defined in the HDFS connection.

Security is thus ensured in two ways:

* DSS automatically adds ACLs on the actual directories corresponding to datasets, which prevents users who are not in the project’s authorized groups from accessing the folder, even in user-controlled code.

* Access through Hive can be controlled using Ranger rules.

Note

This default configuration should be usable by all, we recommend that you keep it.

### Adding a project[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id8 "Permalink to this headline")

In that setting, adding a project requires adding a Hive database and granting permissions to the project’s groups on the database.

* Create the project in DSS

* Add the groups who must have access to the project

By default, the new database is called `dataiku\_PROJECTKEY` where `PROJECTKEY` is the key of the newly created project. You can configure this in the settings of each HDFS connection.

As Hive administrator:

* As Hive administrator, using beeline or another Hive client, create the database

* As the Ranger administrator, perform the grants at Hive level

### Adding a user to a group[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#adding-a-user-to-a-group "Permalink to this headline")

Read ACLs are group-level, so no intervention is required when a user is added to a group.

### Removing a user from a group[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#removing-a-user-from-a-group "Permalink to this headline")

The removed user might still have a write ACL if he was the last to modify some datasets. You need to resynchronize the ACLs on all affected datasets in all projects where the user had access.

* Use the Authorization matrix to check where the user had access

* Remove the user

* For each affected project, go to Project > Settings > Config > Security and click “Resync ACLs”.

### Adding access to a group[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#adding-access-to-a-group "Permalink to this headline")

When you add project access to a group, you need to resynchronize the ACLs on the project’s datasets. This will ensure that the new group has access.

* Do the permission change on the DSS project

* Go to Project > Settings > Config > Security and click “Resync ACLs”.

In Ranger, add the permissions

### Removing access from a group[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#removing-access-from-a-group "Permalink to this headline")

When you remove project access to a group, you need to resynchronize the ACLs on the project’s datasets. This will ensure that the group loses existing access.

* Do the permission change on the DSS project

* Go to Project > Settings > Config > Security and click “Resync ACLs”.

In Ranger, remove the permissions.

### Interaction with externally-managed data[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id9 "Permalink to this headline")

DSS only manages ACLs on the connections where managed datasets are written. DSS does not manage ACLs on “external” connections (this is controlled by the “Synchronize read ACL” and “Write ACL synchronization” settings in the HDFS connection).

It is the administrator’s responsibility to ensure that read ACLs on these datasets are properly set.

#### Existing Hive table[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id10 "Permalink to this headline")

If externally-managed data has an existing Hive table, and no synchronization to the Hive metastore, you need to ensure that Hive-level permissions (Ranger) allow access to all relevant groups.

#### Synchronized Hive table[¶](https://doc.dataiku.com/dss/latest/user-isolation/reference-architectures/hdp.html#id11 "Permalink to this headline")

Even on read-only external data, you can ask DSS to synchronize the definition to the Hive metastore. In that case, you need to ensure that the HDFS-level permissions allow the Hive (and maybe Impala) users to access the folder.
