Hyper-Converged Data Platform: Unification of pub-sub, compute and storage
The decade of the Big Data revolution has seen data platform evolve from Batch Only systems to Batch and Real-time Systems. The result is a set of data systems that are either batch or real-time specific. HDFS/MapReduce and Kafka/Storm are good examples of this bunch. While a few efforts have indeed been made to converge real-time and batch compute(most notably Apache Flink and Apache Spark), these have been piecemeal affairs without taking into account data storage architecture.
Meanwhile efforts are being made to define a truly event driven system, the so called Kappa architecture with a single source of truth(aka the log) and a single way to compute.
However these efforts are focused on fitting these existing systems to arrive at this architecture. The result is an unoptimized architecture primarily due to the legacy architecture of these existing systems.
This talk explores the requirements of a converged event driven architecture. We see how the concept of Stream Storage becomes the fundamental building block of such an architecture. We then describe how a single compute platform can sit on top of this stream storage to result in a converged data platform.
参考译文:
纵观这数十年大数据变革带来的洗礼,数据平台已从原来的 Batch Only(单批处理)系统迭代为 Batch and Real-time(批处理和实时处理)系统。一系列数据处理系统应运而生,要么是专用于批处理,要么是专用于实时处理的,HDFS/MapReduce 和 Kafka/Storm 就是很好的例证。然而,经过业界不断的努力,终于实现了将批处理和实时计算融合到一起,而最值得一提的就是 Apache Flink 和 Apache Spark,但这些只是数据处理架构,而没有考虑到数据存储架构。
同时,人们正在努力定义一个真正的事件驱动系统,即所谓的 Kappa 架构,它具有唯一的真实数据来源(即log)和单向的数据流处理。
然而,这些努力主要集中在怎样使用已有的系统来适应并满足这种 Kappa 架构,导致的结果就是一个基于传统系统的遗留架构之上的未优化的架构。
这次演讲会探讨融合的事件驱动架构具体需要哪些要素。流存储的概念如何成为这种架构的基本组成部分。随后展示如何在流存储之上构建一个计算平台,从而形成一个融合的数据平台。