Orchestrating data for machine learning pipelines Otesanya David March 25, 2022

Orchestrating data for machine learning pipelines

Orchestrating data for machine learning pipelines


Machine learning (ML) workloads require efficient infrastructure to yield rapid results. Model training relies heavily on large data sets. Funneling this data from storage to the training cluster is the first step of any ML workflow, which significantly impacts the efficiency of model training.

Data and AI platform engineers have long been concerned with managing data with these questions in mind:

  • Data accessibility: How to make training data accessible when data spans multiple sources and data is stored remotely?
  • Data pipelining: How to manage data as a pipeline that continuously feeds data into the training workflow without waiting?
  • Performance and GPU utilization: How to achieve both low metadata latency and high data throughput to keep the GPUs busy?

This article will discuss a new solution to orchestrating data for end-to-end machine learning pipelines that addresses the above questions. I will outline common challenges and pitfalls, followed by proposing a new technique, data orchestration, to optimize the data pipeline for machine learning.

Common data challenges of model training

An end-to-end machine learning pipeline is a sequence of steps from data pre-processing and cleansing to model training to inference. Training is the most crucial and resource-intensive part of the entire workflow.

The following diagram shows a typical ML pipeline. It begins with data collection, then comes data preparation, and finally model training. During the data collecting phase, it usually takes data platform engineers a significant amount of time to make the data accessible for data engineers, who to prepare the data for data scientists to build and iterate models.

orchestrating data 01 Alluxio

During the training phase, unprecedented volumes of data are processed to ensure the continuous feeding of data to the GPUs that generate models. It is imperative that data be managed to support the complexity of ML and its executable architecture. In the data pipeline, each step presents its own technical challenges.

Copyright © 2022 IDG Communications, Inc.


Source link

Write a comment