Friday, 16 March 2018

What’s new in Apache Spark? Low-latency streaming and Kubernetes

You’d be forgiven for passing by the announcement of Apache Spark 2.3. After all, it’s a point release, isn’t it? Sure, there will be some bug fixes, maybe an improvement or two to the MLLib framework, maybe an extra operator or something, but nothing all that major. That will be saved for Apache Spark 3.0, surely?

In fact, this is no mere point release. Apache Spark 2.3 ships with two major new features, one of which is perhaps the biggest (and often-requested) change to streaming operations since Spark Streaming was added to the project. The other is native integration with Kubernetes to execute Spark jobs in container clusters.

Apache Spark on Kubernetes

For several years now, Apache Spark has offered a trio of cluster deployment options: standalone mode, Apache Mesos, and Apache Hadoop YARN. In practice, this has meant that most enterprises find themselves running Apache Spark on YARN, which means running an Apache Hadoop stack.

Meanwhile, the past couple of years have brought the astonishing rise of Kubernetes, the open source container orchestration system. Drawing on Google’s 15 years of deploying applications at huge scale, Kubernetes is seeing rapid adopted across the industry. With Apache Spark 2.3, Kubernetes support is finally bundled into the project, albeit behind an “experimental” label. You will be able to run your Spark applications as pods and have them managed, monitored, and maintained by Kubernetes.

Kubernetes support in Apache Spark is likely to be game-changing as it moves out of experimental status. Apache Hadoop deployments can be temperamental beasts. Many enterprises will eliminate the need for YARN entirely by moving their Apache Spark deployments onto Kubernetes, either on-premises or in the cloud. One fewer system for our infrastructure teams to care for!

Also, if they are running their clusters in the cloud, they may be able to replace HDFS with a managed storage system such as Amazon S3, Azure Data Lake, or Google Cloud Storage, once again bringing cheers from the infrastructure team. It also raises questions about what the future holds for Apache Hadoop in this new container-based world. Now that many of the features we relied on Hadoop for in the past are being provided by Kubernetes, many of us will finally be able to leave Hadoop behind. 

Continuous Processing in Spark Structured Streaming

For as long as I’ve used Apache Spark, there has been a cloud hanging over Spark Streaming, in that the micro-batching approach that it uses for processing data means that it cannot guarantee low-latency responses (I discussed this here two years ago!).

For the vast majority of applications, this isn’t a problem, but when you really need that low-latency response (for example, ad-bidding platforms or bot fraud detection), you have to turn to something else, perhaps Apache Storm or Apache Flink, that can make these guarantees.

When Structured Streaming arrived in Apache Spark two years ago, people noticed that the micro-batching architecture of Spark Streaming was abstracted away from the new API. If the micro-batches could be hidden, then perhaps they could be swapped out for a different system?

The answer to that question has turned out to be a resounding yes. Apache Spark 2.3 brings Continuous Processing to Structured Streaming, which can give you low-latency responses on the order of around 1ms, instead of the 100ms you’re likely to get with micro-batching.

While this mode currently has “experimental” stamped all over it, and you can’t currently use certain Structured Streaming features such as SQL aggregations, I know that a few companies are already using it in production.

If nothing else, Continuous Processing will remove one of the big reasons companies choose Apache Storm, Apache Flink, or Apache Apex over Apache Spark, and shows that the developers behind the Spark project are still bringing major new features to the table. Death to micro-batches!

Faster Python on Spark

It almost goes without saying that Apache Spark 2.3 brings a bunch of needed bug fixes and minor improvements to the platform. Most of these aren’t so exciting to talk about, but one area worth mentioning is the ongoing work to improve the performance of Python.

As one of the leading languages used by data scientists, PySpark is a popular way of writing Apache Spark code… right up until you need to wrangle more performance out of the system. Because PySpark needs to copy data back and forth between the Python runtime and the JVM that Apache Spark runs on, there has always been a lag in performance between Java or Scala and Python code.

Using dataframes/datasets and code generation techniques, a lot of this lag has been removed, but if you’re using something like Pandas in your Python code, that data still has to cross the JVM/Python boundary. Apache Spark 2.3 includes a lot of new code that uses Apache Arrow and its language-independent memory format to reduce the overhead of accessing data from Python. They’ve also expanded Python’s SparkSQL UDF support to fully-vectorized execution, allowing the Python runtime (and perhaps native extensions) to process blocks of rows at once, instead of the slower row-at-a-time method previously provided.

Download Apache Spark 2.3

For a point release, Apache Spark 2.3 includes several compelling new features. If you evaluated Spark in the past and had to dismiss it due to low-latency requirements, you should check out whether Continuous Processing will meet your needs. As for the rest of us, let’s begin the march to Kubernetes and the day when we can send the Apache Hadoop cluster to the farm upstate. We will all be in a better place.

https://www.infoworld.com

No comments:

Post a Comment