Thursday, 3 May 2018

Why you should use Python for machine learning

What is it about Python—the language, the ecosystem, the development processes around them—that has made it into such a favorite for data science?

Python has long enjoyed growing popularity in many areas of software development—scripting and process automation, web development, general applications. More recently it has become a leading language in machine learning. In this article we’ll look at the four major reasons why Python has become a juggernaut in that field.

The first major reason is of a piece with why Python has become a general success story: the language makes things simple and keeps them simple.

When Python was first developed, a major goal of the language was to be easy to both write and read. Code is read far more often than it’s written, especially in environments where it changes hands from one team to another. If you’re inheriting a machine learning application from another developer, especially one that makes use of multiple third-party components or has a good deal of custom business logic, it helps to have it written in a language that adds as little extra cognitive overhead as possible. Good Python code will have that quality, even more so than other languages.

Another key way Python’s language design is useful for machine learning is providing high-level, object-based abstractions for tasks. Machine learning applications are the result of complex, multi-stage workflows. The more attention you can pay to the essence of what needs to be accomplished, and the less attention you must pay to the nitty-gritty of the implementation, the better. Python puts enough distance between you and the job at hand that you’re not overwhelmed by it at first glance.

Python has the machine learning libraries
The second major reason Python has become a machine learning workhorse is the wealth of machine learning libraries and frameworks available for it. Beginning with the venerable Scikit-learn, most every big-name machine learning and deep learning framework—TensorFlow, CNTK, Apache Spark MLlib—is either a first-class citizen in the Python world, or has a Python API. Some, like PyTorch, are written with Python specifically in mind, as the name implies, but without compromising performance.

Python’s library ecosystem allows many of these frameworks to be installed in one’s workspace with little more than a single command. Some of this was made possible only fairly recently, after Python changed its library packaging mechanisms to make it easier to distribute the platform-specific binaries needed for many machine learning frameworks.

Note that this packaging system can still fall short, which has inspired some workarounds. Distributions like Anaconda have their own packaging mechanisms to deal with issues like binary dependencies from outside the Python ecosystem. But by and large the Python package ecosystem provides a level of convenience for working with machine learning that echoes the ease and convenience found throughout Python generally.

Python handles memory management for you
The abstraction provided by high-level languages like Python, and the jobs they’re used for, extend into many other realms. In Python, the finer details of memory management are concealed from the programmer, who consequently has more mental bandwidth to focus on the problem at hand.

Python’s built-in constructs and data abstractions—lists, sets, dictionaries, and tuples—are all memory-managed by the Python runtime. Java works much the same way, but Python is generally less verbose than Java and puts fewer procedural barriers between the user and the end results.

Machine learning apps use Python’s memory-managed constructions more for the sake of organizing an application’s logic or data flow than for performing actual computation work. Most of the computational heavy lifting is handled by external libraries like NumPy (more on those below). But again, the abstraction provided by the language and runtime means that the memory management duties for such things are automatically handled several layers below the user’s actions.

That said, it pays to learn how Python manages memory internally. Python trades efficiency for ease of use, sometimes in ways that are not always obvious. For the best possible performance across the board, you’ll eventually want to “lift the hood” and work with the lower-level abstractions that are available.

For some examples of this, see the “Python Memory Management” section in the documentation for the Theano machine learning library. (Theano is no longer developed, but many of the principles discussed apply generally.) Another way to optimize performance is Cython, the utility library that allows Python code to be translated into C. Cython allows direct access to C’s memory management and data constructs in ways that “vanilla” Python doesn’t.

Python’s speed is not an issue
“Convenient, but not fast” is how many people describe Python. For the most part, that’s correct. With Python, you’re generally trading some raw performance for ease of development. So if Python is not the fastest language available, why use it for computationally intensive work like machine learning?

The short answer: It’s not Python that is doing the computationally intensive work.

The vast majority of the actual computation work done in Python machine learning applications is performed by libraries, typically written in C, C++, or Java, that Python wraps and interacts with. The parts of the application that run in Python typically aren’t performance sensitive—they handle setup and teardown, command and control, coordination between components, logging and reporting, and so on.

Python’s use of external libraries can still lead to performance problems, though, if the application spends a lot of time moving back and forth between the Python and accelerated-library domains. Each such context switch incurs a performance hit, so a developer needs to minimize the number of round trips between Python and external APIs. That said, this issue isn’t unique to machine learning; it’s a common anti-pattern in Python and programming generally, so it isn’t hard to identify.

It is also possible to speed up pure-Python code, if need be, through a variety of tools: Cython (for converting Python to C), Numba (for JIT-accelerating math-centric code), PyPy (for JIT-accelerating Python code), and so on.

Ultimately, it’s the whole package, not just any one feature, that makes Python appealing for machine learning: an easy-to-learn and easy-to-use language, an ecosystem of third-party libraries that cover a vast range of machine learning use cases, and performance to match the job at hand.

https://www.infoworld.com

No comments:

Post a Comment