Databricks is a cloud-based software platform for data engineering, data science and machine learning. It provides a scalable environment for running high-performance data applications, and support for large datasets and high volumes of data processing.

Databricks machine learning

Organisations use it to create, run and manage Apache Spark clusters in the cloud or on-premises. It also provides collaboration tools, such as Jupyter Notebooks and Apache Zeppelin notebooks (Apache Zeppelin is an open source web application that allows users to write interactive data analysis queries in languages such as SQL and Python/Scala/R).

The platform offers users the ability to run SQL queries against Spark SQL and Hive tables, as well as perform ETL operations on Databricks Delta, an Amazon S3-compatible object storage service that supports high-performance reads and writes at scale.

The platform also allows users to run Apache Spark jobs in a distributed environment with support for multiple languages, including Scala, Java, Python and R. Users can use Databricks Runtime for Apache Spark to run their jobs in clusters across the cloud, which can be either Google Cloud, AWS or Microsoft Azure.

Databricks has three main components: Databricks Unified Analytics Platform (DUAP), Databricks Streaming and Apache Zeppelin. DUAP is a cloud-based data platform that provides easy access to Spark and other tools such as MongoDB, Amazon Redshift, Tableau and RStudio. It also includes an interactive analytics notebook called Databricks Notebook, which enables fast data exploration using SQL and Scala programs.

Databricks Streaming allows users to easily create real-time streams from any source to Apache Kafka or Apache Flume (or HDFS). This means that, for example, you can send data from websites or sensors directly to a cluster for processing without having to worry about keeping multiple systems in sync.