Announcing Public Preview of Volumes in Databricks Unity Catalog

Introducing the Public Preview of Volumes in Databricks Unity Catalog: More Power and Simplicity for Efficient Data Management

Introduction:

Welcome to the world of data management and AI at the Data and AI Summit 2023! We are thrilled to introduce a groundbreaking feature called Volumes in Databricks Unity Catalog. With Volumes, users can now explore, process, govern, and track lineage for any type of data, whether tabular, unstructured, semi-structured, or structured. And the best part? Volumes is now available for public preview on AWS, Azure, and GCP.

In this blog, we will delve into various use cases related to non-tabular data, detail the key capabilities of Volumes in Unity Catalog, showcase a practical example of Volumes in action, and provide step-by-step instructions on how to get started with Volumes.

The Databricks Lakehouse Platform is a powerhouse when it comes to storing and processing large volumes of data in different formats. While tables are commonly used for data governance, there are specific scenarios, such as machine learning and data science workloads, that require access to non-tabular data like text, image, audio, video, PDF, or XML files.

Some of the common use cases we have encountered include running machine learning on vast collections of unstructured data, storing and sharing datasets for model training and operational data, exploring and querying non-tabular data during the data science phase, and working with tools that do not directly support Cloud object storage APIs.

Volumes in Unity Catalog allow you to build scalable and efficient file-based applications that process non-tabular data, regardless of its format, at Cloud-storage performance. You can now easily manage, discover, and govern non-tabular data alongside tabular data and models within Unity Catalog, providing a unified experience for data management and governance.

To provide a practical illustration of Volumes, let’s consider an example where we want to perform image classification using a dataset of cat and dog images. By leveraging Volumes, we can seamlessly integrate the dataset into Databricks for data science tasks. We will demonstrate the process of creating a Volume, granting access permissions, uploading the image files, extracting the archive, and finally, classifying the images using a machine learning model.

With powerful capabilities like governance, flexible storage configuration, cloud-scale processing, and user-friendly interfaces, Volumes in Unity Catalog revolutionize the way you work with non-tabular data. Say goodbye to the limitations of traditional data management systems and embrace the efficiency and scalability of Volumes.

You May Also Like to Read  The Unveiling of Groundbreaking Generative AI: Your Inaugural Glimpse into Limitless Creativity

Are you ready to take a deep dive into the world of non-tabular data management? Let’s get started with Volumes in Databricks Unity Catalog!

Full Article: Introducing the Public Preview of Volumes in Databricks Unity Catalog: More Power and Simplicity for Efficient Data Management

Introducing Volumes in Databricks Unity Catalog at Data and AI Summit 2023

At the Data and AI Summit 2023, Databricks unveiled a new feature called Volumes in Unity Catalog. This feature allows users to explore, govern, process, and track lineage for non-tabular data. It supports a wide range of data types, including unstructured, semi-structured, and structured data, along with tabular data. The public preview of Volumes is now available on AWS, Azure, and GCP.

Use Cases for Non-Tabular Data Governance and Access

While Databricks Lakehouse Platform is primarily used for storing and processing data in tabular format, there are many scenarios that require access to non-tabular data for machine learning and data science workloads. Some common use cases include:

1. Running machine learning on large collections of unstructured data, such as images, audio, video, or PDF files.
2. Persisting and sharing training, test, and validation data sets used for model training.
3. Uploading and querying non-tabular data files during data exploration stages in data science.
4. Working with tools that don’t natively support cloud object storage APIs and prefer files on the local file system of cluster machines.
5. Storing and providing secure access to libraries, certificates, and configuration files of arbitrary formats before using them to configure cluster libraries or notebook dependencies.
6. Staging and pre-processing raw data files before loading them into tables in an ingestion pipeline.
7. Sharing large collections of files within or across workspaces, regions, clouds, and data platforms.

Overview of Volumes and How to Use Them

Volumes are a new object type in Unity Catalog that catalog collections of directories and files. They represent logical volumes of storage in a cloud object storage location and offer capabilities for accessing, storing, and managing data in any format. This includes structured, semi-structured, and unstructured data. Volumes provide a unified discovery and governance experience, allowing users to manage and track lineage for non-tabular data alongside tabular data and models in Unity Catalog.

Practical Application Example: Image Classification

To demonstrate the practical application of Volumes, let’s consider an example of using machine learning for image classification. Suppose we have a dataset consisting of cat and dog images that we want to use for image classification. We can download these images to our local machine and then incorporate them into Databricks for data science purposes.

Using the Data Explorer user interface, we can create a new Volume within a Unity Catalog schema and upload the image files. We can also grant access permissions to our collaborators. Alternatively, we can use SQL commands in a notebook or the SQL editor to create our own Volume and manage its permissions.

You May Also Like to Read  Episode 16 of the Becoming a Data Scientist Podcast: An Engaging Conversation with Randy Olson

Once the images are uploaded to the Volume, we can extract the image archive using the unzip utility, specifying the path to our Volume. We can then access the images through a notebook and perform operations such as displaying them using libraries like PIL.

We can further classify the images using a pre-registered zero-shot image classification model in the MLflow Model Registry within Unity Catalog. We can load the model, perform the classification, and display the resulting predictions.

Key Capabilities of Volumes in Unity Catalog

Volumes in Unity Catalog offer several essential capabilities for managing non-tabular data:

1. Governance: Volumes are cataloged inside schemas alongside tables, models, and functions, following the core principles of the Unity Catalog object model. Data stewards can create Volumes, set permissions, and manage ownership.
2. Storage Configuration: Volumes can be configured as managed or external. Managed Volumes store files in the default storage location for the Unity Catalog schema, while external Volumes store files in an external storage location.
3. Cloud Storage Performance and Scale: Volumes are backed by cloud object storage, allowing high-traffic workloads and processing of large-scale data at cloud storage performance.
4. User Interface Integration: Volumes are seamlessly integrated across the Databricks Platform, providing a state-of-the-art user interface for managing Volume permissions, lifecycle, content, and more.
5. Lineage: Volumes support lineage tracking, enabling users to trace the data flow and dependencies for both tabular and non-tabular data.

Accessing and Working with Volumes

Volumes provide a dedicated path format that reflects the Unity Catalog hierarchy and defined permissions. This path can be used to reference files for various operations in Databricks, including Apache Spark, Spark SQL, Pandas, shell commands, library installs, operating system file utilities, and more.

Unlocking New Processing Capabilities with Volumes

Volumes provide an abstraction layer over cloud-specific APIs and Hadoop connectors, allowing users to process data managed by Unity Catalog using familiar tools and APIs. This simplifies data processing tasks and enhances productivity.

In conclusion, the introduction of Volumes in Databricks Unity Catalog at the Data and AI Summit 2023 brings exciting capabilities for managing, processing, and governing non-tabular data. With Volumes, users can unlock the power of Cloud storage performance and scale while enjoying a user-friendly interface and seamless integration with the Databricks Platform.

Summary: Introducing the Public Preview of Volumes in Databricks Unity Catalog: More Power and Simplicity for Efficient Data Management

At the Data and AI Summit 2023, Databricks introduced Volumes in Unity Catalog, a feature that allows users to discover, govern, process, and track lineage for non-tabular data. The public preview of Volumes is now available on AWS, Azure, and GCP. This blog discusses common use cases for non-tabular data and provides an overview of the key capabilities of Volumes in Unity Catalog. It also showcases a working example of using Volumes for image classification and provides details on how to get started with Volumes. Volumes enable scalable file-based applications that can read and process large collections of non-tabular data, regardless of its format, at cloud-storage performance.

You May Also Like to Read  The Impact of Commercializing Generative AI on Society: Exploring the Prospects

Frequently Asked Questions:

Q1: What is data science?

A1: Data science is an interdisciplinary field that combines statistical analysis, advanced computing techniques, and domain expertise to extract insights and knowledge from structured and unstructured data. It involves collecting, processing, and analyzing large volumes of data to uncover patterns, trends, and correlations that can be used to make informed business decisions or solve complex problems.

Q2: What are the key skills needed to become a data scientist?

A2: To excel in data science, proficiency in programming languages like Python or R is crucial as they are commonly used for data manipulation and analysis. Strong statistical and mathematical knowledge is essential for understanding and applying various analytical techniques. Additionally, skills in data visualization, machine learning, and problem-solving are highly valued in this field. Being able to communicate insights effectively is also important for presenting findings to stakeholders.

Q3: How does data science contribute to business decision-making?

A3: Data science plays a vital role in making informed business decisions by using data-driven insights. By analyzing large volumes of data, data scientists can identify trends, patterns, and correlations that may not be immediately apparent. These findings can help businesses optimize processes, enhance customer experiences, target marketing efforts effectively, predict trends, and make accurate forecasts. Data science allows companies to gain a competitive edge, improve efficiency, and drive growth.

Q4: What is the difference between data science and machine learning?

A4: While data science and machine learning are closely related, they are not interchangeable terms. Data science refers to the broader discipline that involves collecting, processing, and analyzing data to gain knowledge and insights. It encompasses various techniques and methodologies, including machine learning.

Machine learning, on the other hand, is a subset of data science that specifically focuses on developing algorithms and models that enable computers to learn from data and make predictions or decisions without being explicitly programmed. It is an application of artificial intelligence that uses statistical techniques to enable machines to automatically improve their performance over time.

Q5: How is data science used in different industries?

A5: Data science has found applications in various industries, transforming how businesses operate. In finance, data science is used for fraud detection, forecasting market trends, and optimizing investment strategies. In healthcare, it helps in diagnosing diseases, analyzing patient data, and predicting patient outcomes. Retail companies use data science for demand forecasting, customer segmentation, and recommendation systems. Furthermore, data science is utilized in transportation and logistics for predictive maintenance, route optimization, and managing supply chain operations. The widespread adoption of data science across sectors illustrates its significance and versatility in shaping modern businesses.