Marketers and businesses who want to use machine learning (ML) beyond the tools previously discussed may require better customization to deploy their specific online models. In addition, marketers as well as business organizations may be able to come up with ways on how their developers can fully use their models which, in this case, may call for other helpful tools.

Self-Managed and Hosted Tools

Cloudera Oryx

Oryx is primarily designed to work on Apache Hadoop. For those who are not familiar with Hadoop, this open-source software platform is designed for storing data sets and executing applications over groups of commodity hardware. It can store a substantial amount of data of any kind, provides a high level of processing power, and is capable of handling a virtually limitless amount of tasks or jobs.

Oryx 2 is a product of Lambda Architecture built upon Apache Kafka and Apache Spark; however, Oryx 2 is designed with a specialty in real-time machine learning at a large scale. This framework allows for easy creation of applications, but also comes with packaged, end-to-end apps designed for collaborative filtering, regression, clustering, and classification.

oryx flow

Oryx is composed of three tiers: Each tier builds over the one below:

  1. The generic Lambda Architecture tier provides serving, batch, and speed layers that are not particular to ML.
  2. Specialization over top provides ML simplifications for hyperparameter selecting and other selections.
  3. The end-to-end execution of similar standard machine learning algorithms gives the application (k-means, random decision forests, or Alternating Least Squares Method for Collaborative Filtering) over the top.

Oryx Implementation

Execute the three layers using:

./ batch
./ speed
./ serving

You don’t have to use these layers over one machine; however, the application can also be possible if the configuration specifies various ports to accommodate the speed and batch layer Spark web user interface along with the port for serving layer API. You can run the serving layer over a number of machines.

Say the batch layer Spark user interface runs on the 4040 port of the machine where you started it — that is, unless this arrangement was altered by a configuration. As a default, the port 8080 will come with a web-based console designed for the serving layer.

Example of Use

A sample GroupLens 100K data set is found in a file. The data set should be converted to CSV:

grouplens use

Provide the input over a serving layer, having a certain local command line tool such as a curl:

grouplens curl

In case you are actually tailing the particular input topic, you will find substantial CSV data flow toward the topic:

grouplens csv data

After a few moments, you will find the batch layer triggering a new computation. This sample configuration will start per five-minute intervals.

The data provided is first delivered to HDFS. The sample configuration has it delivered to the directory hdfs:///user/example/Oryx/data/. Located inside are directories named by a time stamp with each one having a Hadoop part — files equipped with input being SequenceFiles of a text. Despite not being pure text, if you have to print them, you should be able to deliver recognizable data since it is actual text.

grouplens text data

Afterward, a model computation will take place. This computation should show results in the form of a number of fresh distributed jobs over the batch layer. The Spark UI will be staired over http://your-batch-layer:4040 as provided in the sample configuration.

The model will then be complete and will be considered as a blend of PMML and various supporting data in the subdirectory of hdfs:///user/example/Oryx/model/. Here’s an example: The model PMML files will be PMML files that contain elements such as the following:

grouplens pmml

The Y/ and X/ subdirectories that go with it come with feature vectors such as:

grouplens vectors

If you’re tailing the particular update topic, you will find these values to be published over the topic. This publication will then be detected by the serving layer later and will provide a return status at the /ready endpoint of 200 OK:

grouplens publish

Apache Mahout

Three primary features of Apache Mahout include the following:

Considering that Mahout’s primary algorithms are designed for classification, you can implement clustering as well as batch-based collaborative filtering over Apache Hadoop through the map/reduce paradigm. However, this process does not prevent any contributions over to Hadoop-based implementations. The contributions which operate over one node or over a non-Hadoop cluster become adapted or welcomed. As an example, the “taste” collaborative-filtering influencer part of Mahout was originally a single project and runs stand-alone without the aid of Hadoop.

An algebraic back end-independent optimizer — along with an algebraic Scala DSL to unify in-memory and distributed algebraic operators — makes up the setting. During this time, the available algebraic platforms supported include H2O, Apache Flink, and Apache Spark. MapReduce algorithms have gradually lost support from users.


H2O is an open-source, scalable ML and a predictive analytics system which provides convenience in building ML models over substantial data. H2O also offers effortless productionalization over models used in an enterprise.

This platform’s code is written in Java language. Within H2O, a certain DistributedKey/Value store gets used for approach as well as reference models, objects, data, and other elements over all the machines and nodes.

The algorithms are implemented over H2O’s dispersed Map/Reduce framework. After implementation, you make use of the Java Fork/Join framework to deliver multi-threading. The particular data gets read in parallel and gets distributed all over the cluster and stored inside a memory within compressed columnar format. Additionally, the H2O data parser comes with a built-in intelligence that guesses the schema of a certain incoming data set and offers data ingest support over a number of sources over various formats.

The REST API from H2O provides access to every capability of H2O out of both external program and script out of JSON on HTTP. The H2O web interface, R binding, and Python binding uses REST API.

Accord.NET Framework

For developers as well as marketers who are presently in the .net working environment, Accord Framework stands out as a robust open-source alternative when deploying models.

Accord.NET Framework is actually a .NET ML framework combined with image and audio-processing libraries coded in C#. Accord.NET is a comprehensive framework used to create production-grade computer vision, signal processing, statistical applications, and computer audition.


MLlib is one popular ML library consisting of basic learning algorithms as well as utilities that include regression, clustering, collaborative filtering, dimensionality reduction, and classification.

This platform conveniently adapts to Spark’s APIs and works conveniently with NumPy from Python and R libraries. Hadoop data sourcing can also be used on this platform, a function which allows it to work effectively with Hadoop workflows as well.

Fully Managed Tools


Yhat is a Y Combinator-supported enterprise delivering an end-to-end data science platform for deploying, managing, and developing real-time decision APIs.

This platform conveniently takes away the pains brought about by IT obstacles and problems which come with cloud-based data science such as server setting and configuration. Through this platform, data scientists and experts are able to convert static insights into available decision-making APIs which can work effortlessly with any client- and employee-facing application. This platform also brought about the birth of Rodeo, an open-source IDE (integrated development environment) for the Python platform.

Developers have also come up with the product ScienceOps, a tool which aims to provide data science users, experts, and developers the capability to seamlessly and quickly deploy models in any environment. You use these models in cases involving lead scoring, image recognition, credit analysis, and various other predictive models.

yhat data

Developers, data scientists, and experts use Python or R to easily and quickly deploy models through Yhat instance. Coders and developers are then able to make use of a REST API in embedding models over to web applications. This embedding allows for real-time ML application. One excellent use case of such technology surrounds the submission and approval of real-time credit applications.

Below is an example of a model deployed through Yhat.

To use Yhat, you generate your model within Python or R session, change the user name and API key fields, and then run the script.

yhat model

The Yhat platform comes with a basic Beer Recommender which provides a simple use case of the platform to generate accurate predictions based on the specific data that a user enters.

Now that we’ve looked at how marketers with an advanced understanding of ML can develop and deploy models online, it’s time to consider how data scientists take modeling to the next level in the next chapter of this guide.