Diving deeper into the topics surrounding machine learning, we’re confronted with a copious amount of jargon. It helps our journey to understand how professionals in the space discuss the topics so that we can become familiar with the terms we’ll run into as we dive deeper into machine learning.

Our goal will be providing an understanding of the various topics without getting too deep into the technical details. We know your time is important, so we’re making sure that the time spent learning the language of machine learning will pay off as we go down the path of utilization and use cases.

Revisiting Supervised and Unsupervised Machine Learning

We grazed past the concept of supervised and unsupervised learning in Chapter 1; however, these topics are important, and they deserve a more in-depth study.

Supervised Machine Learning

As previously discussed, supervised machine learning involves human interaction elements to manage the machine learning process. Supervised machine learning makes up most of the machine learning in use.

The easiest way to understand supervised machine learning is to think of it involving an input variable (x) and an output variable (y). You use an algorithm to learn a mapping function that connects the input to the output. In this scenario, humans are providing the input, the desired output, and the algorithm.

Let’s look at supervised learning in terms of two types of problems:

Classification – Classification problems use categories as an output variable. Example categories would be demographic data such as sex or marital status. A common model for these types of problems are support vector machines. Despite their odd name, support vector machines are a way we describe a linear decision boundary between classes that maximizes the width of the boundary itself.

Regression – Regression problems are where the output variables are a real number. A common format of these types of problems are linear progressions. Linear regression models determine the impact of a number of independent variables on a dependent variable (such as sales) by seeking a “best fit” that minimizes squared error. Other regression models combine linear models for more complex scenarios.

One of two basic types of machine learning models uses labeled data to build models that make predictions. “Labeled” means that the data represents something in the past for which the outcome is known (for example, an item purchased).

Ensemble Methods – These learning algorithms construct a set of classifiers and then classify new data points by taking a weighted vote of their predictions (the original ensemble method was known as the Bayesan average). This mathematical concept estimates the mean of a population using outside information, including pre-existing beliefs, much like what you’d do to find a weighted average. Ensemble methods reduce variance in individual models by combining a number of them together and averaging predictions.

Logistic Regression – This regression model comes into play when the dependent variable is categorical. Given this categorical dependent variable, the logistic regression gets used as a classifier. This model maps variables between 1-0 (such as true or false and pass or fail). This two-outcome model is also called binomial or binary logistic regression.

Unsupervised Machine Learning

As opposed to supervised learning, unsupervised learning involves only entering data for (x). In this model, a correct answer doesn’t exist, and a “teacher” is not needed for the “learner.”

You’ll find two types of unsupervised machine learning: clustering and association.

Clustering – This type describes techniques that attempt to divide data into groups clustered close together. An example of clustering is grouping customers by their purchasing behavior.

Association – This type describes techniques that create rules that explore the connections among data. An example is helpful here: We might say that people who buy X product also often buy Y product.

Semi-supervised Machine Learning

But wait … you’ll also find a third type of machine learning technique, semi-supervised machine Learning. Think of this type as a hybrid between the two previously mentioned models. In most cases, this type of learning happens when you’ve got a large data set for (x), but only some of (y) is definitive and capable of being taught.

Semi-supervised machine learning can be used with regression and classification models, but you can also used them to create predictions.


Decision trees build a series of branches from a root node, splitting nodes into branches based on the “purity” of the resulting branches. You use decision trees to classify instances: One starts at the root of the tree. By taking appropriate branches according to the attribute or question asked at each branch node, one eventually comes to a leaf node. The label on that leaf node is the class for that instance.

This modeling is the most intuitive type of modeling. You’ve likely used some version of it in your school or professional life.

Backed-up Error Estimate – In order to prune decision tree and keep the “purity” of each branch of the tree, you must decide whether an estimated error in classification is greater if the branch is present or pruned. A system to measure these issues takes the previously computed estimated errors associated with the branch nodes, multiplies them by the estimated frequencies that the current branch will classify data to each child node, and sums the resulting products. Training data gets used to estimate instances that are classified as belonging to each child node. This sum is the backed-up error estimate for the branch node.

Naive Bayes – Naive Bayes classifiers are based on applying Bayes’ theorem with naive independence assumptions between the features. Bayesian inference focuses on the likelihood that something is true given that something else is true. For example, if you’re given the height and weight of an individual and are asked to determine whether that person is male or female, naive Bayes can help you to make assumptions based on the data.

Reinforced Learning

As discussed in the previous chapter, reinforced learning is a machine learning evolution that involves neural network development. Reinforced learning combines multiple layers of networks into complex models that are useful for building intelligent models.

Neural Networks – This collection of artificial neurons (or machine learning and algorithmic layer) gets linked by directed weighted connections. Each neural network layer feeds the next. The neural network levels each have input units clamped to desired values.

Clamping – This action takes place when a layer, also called a “neuron,” in a neural network has its value forcibly set and fixed to some externally provided value, otherwise called clamping.

Activation Function – This function describes the output behavior of a neuron. Most networks get designed around the weighted sum of the inputs.

Asynchronous vs. Synchronous – Some neural networks have layers of computations that “fire” as the same time, with their nodes sharing a common clock. These networks are synchronous.

Other networks have levels that fire at different times. Before you stop us, let’s clarify something for you: We’re not saying that there’s not a pattern to how the levels handle input and output data; we’re only saying that the levels aren’t firing in a precisely timed way.

Mathematical Concepts

Dimensionality Reduction – This concept uses linear algebra to find correlations across data sets.

Principal Component Analysis (PCA) – This process identifies uncorrelated variables called principal components from a large set of data. The goal of principal component analysis is to explain the maximum amount of variance with the fewest number of principal components. This process is often used in utilities that work with large data sets.

Singular Value Decomposition (SVD) – This technique combines information from several vectors and forms basis vectors that can explain most of the variances in the data.

Graph Analysis – This process involves using “edges” connected to other numerical data points to analyze networks. The data points are known as nodes, and the edges are the ways in which they get connected. Facebook’s EdgeRank, which was replaced by a more advanced machine learning algorithm, got its name from graph theory.

Similarity Measures – Sometimes known as a similarity function, a similarity measure quantifies the similarity between two objects. Cosine similarity is a commonly used similarity measure. This measurement is used in information retrieval to score the similarity between documents. Clustering uses similarity measures to determine the distance between data points. Shorter distances are equivalent to greater similarity.