# Connecting to data[¶](https://doc.dataiku.com/dss/latest/connecting/index.html#connecting-to-data "Permalink to this headline")

The first task when using Data Science Studio is to define *datasets* to connect to your data sources.

A dataset is a series of records with the same schema. It is quite analogous to a table in the SQL world.

For a more global explanation about the different kinds of datasets, see the DSS concepts page.

* Supported connections

+ Connectors

+ File formats

- Standard formats

- Hadoop/Spark specific formats

* SQL databases

+ Introduction

- Supported databases

* Full support

* Tier 2 support

* Other databases

- Defining a connection

- Advanced connection settings

* Advanced JDBC properties

* Custom JDBC URL

* Fetch size

* Truncate to clear data

* Naming rules

+ Snowflake

- Connection setup (Dataiku Custom or Dataiku Cloud Stacks)

- Connection setup (Dataiku Online)

* To set up the connection with global credentials

* To set up the connection with per-user credentials

- Authenticate using OAuth2

* Common errors

- Writing data into Snowflake

* Requirements on the cloud storage connection

* Explicit sync from cloud

- Unloading data from Snowflake to Cloud

- Extended push-down

- Spark native integration

- Snowpark integration

- Switching Role and Warehouse

* How to set it up

- Limitations and known issues

* Visual recipes

* Coding recipes

* Spark native integration

* Breaking changes

- Advanced install of the JDBC driver

* Spark integration

+ Azure Synapse

- Installing the JDBC driver

- Write into Azure Synapse

* Explicit sync from Azure Blob Storage

- Unload data from Synapse to Azure Blob

- Login using OAuth

* Login as a single account

* Login with per-user OAuth tokens

* Common errors

+ Google BigQuery

- Supported and unsupported features

- The two drivers

- Installing the JDBC driver

* Built-in driver

* Google-provided driver

- Connecting to BigQuery

* Using Service Account

* Using OAuth2

* Advanced setup (if using the Google-provided driver)

- Writing data into BigQuery

* Explicit sync from GCS

- BigQuery native partitioning and clustering

+ Amazon Redshift

- Setting up (Dataiku Custom or Dataiku Cloud Stacks)

* Selecting the JDBC driver

* Installing the dedicated driver

- Setting up (Dataiku Online)

* Limitations

- Writing data into Redshift

* Constraints on the S3 connection

* Explicit sync from S3

* Technical details about implementation

- Unloading data from Redshift to S3

- Reading external tables

- Controlling distribution and sort clauses

- Limitations

+ PostgreSQL

- Installing the JDBC driver

- Secure connections (SSL / TLS) support

* Setup with certificate validation (recommended)

+ Importing the server certificate

+ Setting up the PostgreSQL connection

* Setup without certificate validation (not recommended)

* PostGIS integration

+ MySQL

- Caveats

- Installing the driver

- Secure connections (SSL / TLS) support

* Importing the server certificate and creating the client certificate

* Setting up the MySQL connection

+ Microsoft SQL Server

- Installing the JDBC driver

- Azure SQL Data Warehouse / Synapse support

- Kerberos authentication

- User impersonation with Kerberos

- Login using OAuth on Azure SQL Server

* Login as a single account

* Login with per-user OAuth tokens

* Common errors

+ Oracle

- Installing the JDBC driver

- Advanced connection properties

* Connect using Service Name

* Kerberos authentication

* User impersonation

+ Teradata

- Installing the JDBC driver

- Connecting using TD2 (default) authentication

* Using per-user-credentials with TD2 authentication

- Connecting using LDAP authentication

* Using per-user-credentials with LDAP authentication

- Connecting using Kerberos authentication

* Using per-user-credentials with Kerberos authentication

- Impersonation

* Prerequisites

* Setup (same DSS / Teradata users)

* Setup (different users)

- Controlling the primary index

- Tracing additional query information

- Limitations

* Personal Connections

* In-database charts

* Sort recipe

* Split recipe

* Parallel build of partitioned datasets

- Fast sync using TDCH

- Notes

+ Pivotal Greenplum

- Installing the JDBC driver

- Controlling distribution

- Setting distribute and sort clauses

- Secure connections

+ Google AlloyDB

- Installing the JDBC driver

- Secure connections (SSL / TLS) support

* Setup with certificate validation (recommended)

+ Importing the server certificate

+ Setting up the AlloyDB for PostgreSQL connection

* Setup without certificate validation (not recommended)

* PostGIS integration

+ AWS Athena

- Supported

- Not supported

- Installing the JDBC driver

- Connecting to Athena

+ Vertica

- Installing the JDBC driver

- Timezones support

+ SAP HANA

- Caveats

+ IBM Netezza

- Caveats

+ Exasol

- Limitations

+ IBM DB2

- Installing the JDBC driver

- Creating a DB2 connection

- Creating DB2 datasets

+ kdb+

- Installing support

- Creating a kdb+ connection

* Amazon S3

+ Create a S3 connection

- Required S3 permissions

- Transfer ownership to the bucket owner

+ Creating S3 datasets

+ Connections path handling

+ Location of managed datasets and folders

- For a “free selection” connection

- For a “path restriction” connection

+ Server-side encryption of files

- Encryption Mode

* Azure Blob Storage

+ Creating a Azure connection

+ Connecting to Azure using OAuth2

- Access using a single service account

- Access using per-user OAuth tokens

- Common errors

+ Creating Azure Blob Storage datasets

+ Connections path handling

+ Location of managed datasets and folders

- For a “free selection” connection

- For a “path restriction” connection

* Google Cloud Storage

+ Create a GCS connection

- Using Service Account

- Using OAuth2

+ Creating GCS datasets

+ Connections path handling

+ Location of managed datasets and folders

- For a “free selection” connection

- For a “path restriction” connection

* Upload your files

+ Storage location

+ Size limitations

* HDFS

+ Compatible filesystems

* Cassandra

+ Requirements

+ Configuring Cassandra cluster connections

+ Configuring Cassandra datasets

- Data Science Studio managed datasets

- External datasets

+ Dataset configuration parameters

+ Dataset and table schemas

+ Restrictions and caveats

- Writing to external datasets

- External partitioned datasets

- Cassandra v1.2 compatibility

* MongoDB

+ Setting up the MongoDB connection

* Elasticsearch

+ Define an Elasticsearch connection

+ Managed Elasticsearch datasets

+ External Elasticsearch datasets

- Partitioning

* Field-based

* Indices-based

+ Search view

* File formats

+ Delimiter-separated values (CSV / TSV)

- Quoting and escaping styles

* Excel-style

+ Example

* Unix-style

+ Example

* Escaping only

+ Example

* No escaping, no quoting

- Usage in datasets

- Usage in recipes

+ Fixed width

+ Parquet

- Requirements

- Applicability

- Limitations and issues

* Case-sensitivity

* Related to Hive

* Related to Impala

* Misc

+ Avro

- Applicability

* Reading Avro files

* Reading Avro files / multiple versions

* Writing Avro files

+ Hive ORCFile

- Compatibility

- Limitations

+ XML

- Handling the structure

* Selection of the data to load

* JSON representation

+ Example

- Using XPath to select data

* Limitations

* Selecting values explicitly

+ Example

+ JSON

- Example

+ Excel

+ ESRI Shapefiles

- Vecmath library

+ Delta Lake

* Managed folders

+ Creating a managed folder

+ Using a managed folder

- Merge Folder Recipe

+ Local vs non-local

+ Usage in Python

+ Usage in R

+ Usage of a folder as a dataset

+ Clearing

* “Files in folder” dataset

* Metrics dataset

* Internal stats dataset

* “Editable” dataset

* kdb+

+ Installing support

+ Creating a kdb+ connection

* FTP

+ Creating a FTP connection

- FTP connection parameters

+ Creating FTP datasets

+ Use the FTP dataset for writing

* SCP / SFTP (aka SSH)

+ Defining the SSH connection

- SSH connection parameters

+ Creating SCP or SFTP datasets

* HTTP

+ Creating a HTTP dataset

- Remote URL definition

+ Partitioned HTTP dataset

- Example

* HTTP (with cache)

* Server filesystem

+ Filesystem connection

+ Create a filesystem dataset

* Dataset plugins

* Making relocatable managed datasets

+ Relocation of SQL datasets

+ Relocation of HDFS datasets

* Clearing non-managed Datasets

* Data ordering

+ Write ordering

+ Read-time ordering

* PI System / PIWebAPI server

+ Setup the authentication preset

+ Setup authentication per user

+ Attribute search Dataset

+ Event frames search Dataset

+ PIWebAPI Toolbox Dataset

+ Assets metrics downloader Recipe

+ Event frames downloader Recipe

+ Transpose & Synchronize Recipe

+ Advanced parameters

+ Data types

+ Time formats
