A Comprehensive Guide to Machine Learning
Our Vector Report served as a delicious appetizer for the world of Machine Learning and how it applies to SEO and search. As we’ve noted, the core of modern search builds upon models based on Machine Learning. As the field has expanded, both search engines and search practitioners have worked to incorporate more robust Machine Learning processes into their methodologies and technologies.
Most of our conversation will center around Google and its expansion of Machine Learning technologies and where they might be going next.
Remember when we talked about the original PageRank algorithm and how this algorithm can be treated as unsupervised learning? We could make arguments against the early Google algorithm being what we consider Machine Learning today, but the reality is that the algorithm aimed to make sense of patterns in unstructured data around the web and put those patterns into a coherent output — we know them as SERPs. Detractors from this view will point out that the algorithm is only one part of the overall information retrieval structure and not by definition pure Machine Learning.
What can’t be argued is that Quality Score — arguably the economic engine of Google — is a Machine Learning–based program. This score uses a large amount of signals to predict future outcomes. The largest signal in this mix is click-through rate (CTR) for well-established accounts with history. A myriad of other signals make up the remaining Quality Score mix, but based on the efficiency of Quality Score and its staying power, SEO sleuths could see the writing on the wall that Google would eventually take a “smarter” approach to its overall algorithm.
Enter RankBrain. RankBrain is an algorithm-learning AI that Google verified existed in October 2015. RankBrain isn’t the algorithm but part of an algorithm, one that Google has stated is the third most important factor today behind links and content.
RankBrain’s real power is understanding never-before-seen queries and delivering results based on its understanding of the term. RankBrain groups search queries close to each other in linguistic similarities into “distributed representations.” Put simply, RankBrain is trying to predict what people “mean” in their search queries by looking at their intent.
Google’s RankBrain can “self-learn” and form new word associations over time. This ability makes it capable of working faster than a human “training” the program to recognize linguistic similarities.
The principles of RankBrain are similar to the word2vec tool, so while RankBrain is a bit of a closed box outside of what Google has let out, we can get a better idea of its structure from looking at this comparative AI tool.
As an open-source toolkit, word2vec uses a text corpus to count the distance between words and product vector representations of words and phrases. The closer in distance the vectors are, the more similar in meaning they are.
According to Google, no one way exists to optimize content for RankBrain specifically. Further, the meat and potatoes of RankBrain’s effects will be found in the longer tail terms where you’re less than likely to be fishing. The main intent for RankBrain is to help with brand-new queries.
However, the one place that RankBrain has the largest effect is in content creation. The idea of keyword density having a major effect on search died long ago, but RankBrain is an overall death knell, putting the focus on comprehensive content that provides immense value to readers. This kind of overwhelmingly informative piece of content on a subject is what we’re creating with this guide as an example.
With RankBrain launched, and the value that Quality Score plays in its advertising system, Google is clearly moving quickly toward innovation in Machine Learning to keep its market lead in the world of search. The next step for SEO practitioners is to look at what the company’s patents tell us about its future intentions about the concept.
Filed in 2012, this patent for classifying data using a hierarchical taxonomy is a method and system for classifying documents. The use case to which Bill Slawski attributed the model at Google is customer support, a system that helps to classify customer issues and route them to the correct channels. In this use case, Google would group various issues related to each other. For example, payment processing would be a subcategory under which various issues would be classified. From this level of classification, the system would build a hierarchy of importance of the issues (also discussed as documents).
This same concept could be applied to ranking algorithms, and the patent gives us insight into how its ranking module handles ranking algorithms in the use case of customer service:
In some implementations, the ranking module ranks document clusters according to one or more metrics. For example, the ranking module may rank the clusters according to the document quantity in each cluster. A cluster with many documents may represent a relatively significant topic, such as a product issue.
Need another example? The ranking module may rank the clusters according to an estimated time to resolution of an issue represented by the cluster. Issues represented by a cluster “software update,” for example, may typically be resolved faster than issues represented by a cluster “hardware malfunction,” a label assigned to the cluster, a number of documents in a cluster, a designated importance of subject matter associated with a cluster, identities of authors of documents in a cluster, or a number of people who viewed documents in a cluster.
One more example for you: A cluster that represents an issue that historically has taken a longer time to resolve may be ranked higher than a cluster representing an issue with a shorter historical time to resolution.
This patent shows that Google has done internal work to classify and rank content based on learned value of documents. Bounce rates and clickthrough rates (CTR) could take the place of “estimated time to resolution” as data points that quickly make this effective addition to ranking algorithm.
Topics and content that matches what the system perceives to be high quality content based on past user-based signals could float to the top of the results. In many ways, Google could use this type of document based understanding as a citation-based replacement to links. Brands and websites get value based on how much internet users talk about them online.
Since the system is constantly “learning” from its generated outputs and adding new “training” from these learning experiences, a reinforced learning system like this one, focused on organic search, would simply need time to learn to be efficient enough to be added to the ranking algorithm.
Since Google has made Machine Learning a priority, organic search tool providers must do the same to keep their customers armed with the right technology for the changing market.
Searchmetrics created a tool called “The Searchmetrics Content Experience.” This tool uses reinforced learning to help SEO managers and content creators within an organization collaborate and create better content.
In this tool set, marketers are able to research topics, optimize content, and measure results.
Searchmetrics uses reinforced learning to deliver related topics in its Topic Explorer tool. Searchmetrics uses NLP to create a group of nodes and edges around related terms.
The optimization tools in the tool set deliver help and support to writers through interactive parameters and live content scoring. Writers learn in real time how to create the most effective piece of content possible. These efforts are all based on reinforced learning around 250 billion continually updated data points in the Searchmetrics index.
What Searchmetrics has effectively created is a tool that has SEO managers thinking, collaborating, and creating content around concepts that are similar to what RankBrain is doing to answer questions for new queries.
Based on the rise of RankBrain, the feelers we’ve gotten from Google on future Machine Learning plans and the tools being created in the SEO space makes one point clear: Comprehensive content is future-proof SEO.
Content is still commoditized, and getting content at scale is difficult and expensive. Great content creates great links, which have real impact on today’s algorithm — the number-one factor on search rankings today. Great content also yields value with systems such as RankBrain that are likely to surface quite comprehensive articles for intensely specific queries.
Based on the other concepts we looked at in this chapter, content that garners quality signals from search visitors is also likely to have an increasing impact in the future, especially as Google is able to associate those data points with content, learn the meaning of that corollary data, and apply it to ranking results.