All articles

Automated Machine Learning as it is

Automated Machine Learning as it is

Automated Machine Learning (AML) is peculiar not only to academic scientists. Perhaps it was twenty or even ten years ago, when AML as a theory had not yet left the laboratories of universities and companies. The business waited patiently until it came to getting real benefits from it. Today this time has come.

Before going directly to the subject of the article, let’s define the terminology.

Automated Machine Learning as it is

Konstantin Vorontsov in his course on machine learning lectures gives the following definition:

“The theory of machine learning is at the junction of applied statistics, numerical optimization methods and discrete analysis. Over the past 50 years, it took shape in an independent mathematical discipline. Machine learning techniques form the basis of an even newer data mining discipline”.

Transferring this definition to AML, we can say that it is “Theory of automatic / automated learning of machines”. In the following, the term “AML” is understood in this way.

It should be noted that, as always happens with any new technology, the business has put AML on solid practical ground. Let’s track the logical chain that led from the first ideas and challenges to the formulation of real objectives and the creation of the first tools in this area.

Data Scientists: a resource, which is not enough

For the correct application of the “best predictions” you need to collect large amounts of data and learn to analyze them properly. Obviously, companies create data science divisions, buy data science technologies, and hire data processing and research specialists (Data Scientist, DS) to realize the full potential of the accumulated information.

Hadoop and similar technologies already allow you to create valuable “collections” from a variety of banal data. However, DS, who understand their meaning, remain not just a scarce resource, but increasingly scarce.

Recently, many resources have repeatedly written about this lack, and the McKinsey report predicted a loss of people with analytical skills at least until 2018.

Automated Machine Learning as it is

The deficit is so strong that the Harvard Business Review has suggested that companies either stop working in this area at all or significantly reduce their standards for job seekers.

Solution in automation

At IT Business Edge, Loraine Lawson wonders whether artificial intelligence of replaces DS specialists in the foreseeable future.

An article by Michael Fitzgerald at MIT Sloan Management Review entitled “Data Scientist in a Can” (2014), and devoted to “analytics as a service”, argued that companies are already trying to automate this function. But he did not describe an obvious difference between “outsourcing analytics” and “automating analytics”. On closer examination, these concepts are quite different.

The detailed data mining Web site KDnuggets conducted its own survey, asking its readers when most of the tasks of the class “expert level data scientist” will be solved automatically, or at least when research processes in this area will be automated?

Opinions of respondents were divided, but not too much. Only 19% of them believed that this area will never be automated at the level of human intelligence. But 51% noted that, in their opinion, such processes will be automated over the next ten years. What are the current successes?

The realities of data analytics automation

Practical examples from various areas express “cautious optimism” about AML.

Washington Post has written about successful attempts to automate anesthesia during surgery (for a man). MIT Technology Review described the machine learning algorithm, which classifies and evaluates pictures (written by man) more accurately than specially trained art historians.

Report by consulting firm A.T. Kearney suggests that “robo advisors” by 2020 will manage $ 2 trillion in investment portfolios of companies. An article in The Atlantic noted that almost half of US jobs could be automated.

So, regardless of the scientific value and “inner beauty” of new algorithms, their primary practical significance in the sense of business is still in the first place. That is, you want to increase net profit, increase sales or reduce costs, no matter what the cost. Consider, for example, what is happening in the very modern area of automatic vehicle control.

Although the full automation of vehicles is still just visible on the horizon, the state of Nevada has almost allowed the use of Daimler Freightliner unmanned tracks on public highways for almost a year.

Automated Machine Learning as it is

AML – Failure History

However, AML is not too new. More than twenty years ago (1995), Unica first released the Pattern Recognition Workbench (PRW) software package, which used an automated trial and error method to optimize the model based on neural networks.

Three years later, Unica, in partnership with Group 1 Software, developed Model 1, a tool that performed automated selection model prediction for four different types of tasks. Nothing particularly good came of it – both companies were soon forced to sell their assets.

Unica became the property of IBM. The original PRW product has experienced a series of rework and is now called IBM PredictiveInsight, being just a collection of “wizards” in IBM Enterprise Marketing Management software.

Two more commercial attempts to create AML software (from the late 1990s) can be noted.

The first one, MarketSwitch, was a marketing optimization solution and included AML inline capabilities. While promoting this product, MarketSwitch boasted that it hired former specialists from the USSR and even promised to “dismiss all its SAS programmers”. But … there was no commercial success, in 2004 MarketSwitch was acquired by Experian, which repositioned the product no more than in the “decision engine”, replacing AML’s capabilities with its analytical service.

Around this time, KXEN, a company founded in France in 1998, developed its machine learning engine, based on model selection technologies. This product, called the “structural risk minimization”, enjoyed a rather modest success, which eventually led to the purchase of SAP (2013) for only $ 40 million.

These early AML efforts from Unica, MarketSwitch and KXEN did not have a significant impact on the business, as it now seems, for two reasons.

First, they “solved” the problem, defining it too narrowly. The solution area was limited to just a few algorithms. They minimized their technical efforts at the expense of the exemplary quality and reliability of these very special products for that time.

Secondly, they positioned their tools no more than as a tool that will help businesses get rid of the need for experienced analysts.

In other words, the industry still did not even understand why and how to use such products, and what their real value is (they still do not fully understand this now).

The Newest AML History

Considering the sad experience described above, over the past few years, even leading analytical software developers (SAS and IBM SPSS) have cautiously limited themselves to adding automated modeling features to their high-end products.

All software described above was exclusively commercial. However, it is worth mentioning the Auto-WEKA, an open source project in the field of AML. Its first version (2013) was a joint project of the University of British Columbia and Freiburg University.

In addition, there are Challenges in Machine Learning (CHALEARN) – a non-profit organization supported by the National Science Foundation and a number of commercial sponsors. CHALEARN organizes an annual review competition for AML products.

What should the AML platform look like?

All requirements for the modern AML platform fall into two categories: support for the machine learning processes themselves and support for the processing of enterprise computing. High-quality AML software must support the machine learning process from start to finish.

– AML software should support interfaces for relational databases, Hadoop, text files, and common data formats, as well as present the results in a clear and concise visualization.

– There are hundreds of data processing algorithms. A recent benchmark study tested 179 of them for only one area. The best way to determine the correct algorithm for a problem and a set of data is to test and then test the methods.

The DS specialist checks a large number of methods and selects the one that works best with specific data sets. But so far there is no generally accepted method that could formalize this work and perform it qualitatively without human participation.

– AML software must include best practices for preprocessing and cleaning data before training.

– Even with heuristics and self-tuning, a comprehensive experimental design may require thousands of exemplary test cycles. AML software should improve computing performance for high performance and fast learning.

– No manager will approve the deployment of the system without a clear understanding of the behavior of the model and its compliance with certain requirements. AML software should provide such tools so that experts and business users can evaluate the results of a simulation experiment, check for deviations, compare models and, in some cases, perhaps even reject the automatic choice.

The best predictive model in the world is worth nothing if it does not provide this. But this is exactly what happened at the time with the winner of the Netflix Prize, a competition that was a complex application using a technology called Pragmatic Chaos. Now this project is, so to speak, in the dustbin of history. Netflix, of course, paid the prize — and then buried the decision because it turned out to be too expensive to deploy and operate.

In addition …

To meet the needs of a modern enterprise, there are three additional requirements.

– AML software should be based on open source software. Developments in publicly available analytical languages, such as Python and R, are much faster than using commercial tooling software. In addition, the use of open source simplifies integration with the Big Data stacks and reduces the total cost of ownership (Total Cost of Ownership, TCO).

– AML software must support a variety of user profiles, including the following — experienced users, analysts, advanced business users, visualizers, etc.

– AML software must match the size of the enterprise in many dimensions related to users, projects, models and data volumes.

In practical terms, this means that the software must support deployment to Hadoop, based on database integration standards and have low operating costs, whether in the cloud or on-premise.