In Google’s I/O Future of Search talk, Ben Gomes said one of the largest initiatives for the future of search is leveraging machine learning to achieve natural language processing (NLP) and reduce the friction of language.
While there are many methods to achieve semantic understanding, TF-IDF is one of the more public facing methods (and long-standing methods) that search engines use to better understand the intent and numerous meanings behind a query.
WHAT IS TF-IDF?
Term Frequency-Inverse Document Frequency is an advanced algorithm that calculates the weight (or importance) of a word across a corpus of documents that are relevant to that given word.
As it relates to search, engines will score a term based on how often that term appears versus how often other words appear with that word. It will then rank it against the other terms that frequently appear across those pages. The higher the TF-IDF score the rarer the word.
- Term Frequency: How often the word appears in the document
- Inverse Document Frequency: Weighing words that appear more frequently (such as and, the, or which are stop words are typically disregarded) and prioritizing unique words that appear commonly across documents.
The term frequency-inverse document frequency score is just one of the numerous topic modeling methods search engines may use to determine which words and phrases are important to a particular topic.
Google’s John Mueller has noted that the corpus TF-IDF pulls from is the entire web (thus making the frequency of a term less specific) and that it is a method that is used in ALL information retrieval, not just search. He recommends that businesses should focus on users and creating useful content (what that is is still tbd).
WHY TF-IDF ANALYSIS IS STILL RELEVANT
Even with the conversation that TF-IDF is one of many topic modeling methods (and not a very important one) this content creator still thinks there is value in understanding and knowing common words that appear across the web for any given topic.
In content optimization basic keyword insertions are not cutting it. Content needs to go beyond the keyword to the topical universe the words reside in. Term Frequency is just one of the many tools we have to help SEOs get a clear view of that universe. Check out Mike King’s SearchLove presentation to learn more about search engines and natural language processing.
Content creators can use TF-IDF to understand which pages are relevant to the topic they are trying to create or optimize. TF-IDF also allows writers to examine the common words and language used to describe a concept or service. This is not about simple keyword insertion or trying to game search, it’s bigger than that. TF-IDF and other topic modeling allows us to look at the universe of a topic and build a vocabulary to fully describe that topic.
HOW TO USE TF-IDF
So how can you use TF-IDF as a content optimization and keyword expansion tool? Easy. Thankfully, there are a slew of tools on the market that do the work of parsing pages and retrieving the TF-IDF score of common words. We review the pros and cons of the tools below, so be sure to scroll down.
For the most part, each tool allows a user to enter a term, select a search engine (Google is the default) and get results of common words and phrases (and their scores), based on the corpus of documents that include that term. For the purpose of this walk-through, we used SearchMetrics Content Experience, an enterprise-level content tool.
1. Set Up Brief & Target Keyword
To start, create a brief under your project and identify the topic. We created a brief with the topic TF-IDF to analyze this blog post for the target phrase TF-IDF. (Why not kill two birds with one stone?)
Once your brief is created and available, (SearchMetrics will notify you) you can start to analyze your content.
Your Content Editor is your dashboard for each individual brief. You post your content/text and analyze a host of different factors including:
- Keyword Usage
- Content Elements
- Competitor URLs
Content Experience is a robust tool that allows for multiple projects, dashboards and briefs for a team of content creators.
We added this blog post to our brief and already have suggestions based on TF-IDF scores on how to improve.
2. Review the Keywords
Based on the topic of your brief and the content in the editor, Content Experience parses the documents of the ranking URLs to see which terms they are using on their pages that you are not. These are Must-Have Keywords. This list shows you how often you are using a term vs. how often you should.
It also provides you with alternative terms to incorporate to satisfy the weighting. Like with “term frequency” which our post was using 3 out of a recommended 7 times. We could incorporate suggestions such as “TF” short for “Term Frequency” or “Term Frequency and Weighting”.
This also shows you which terms you are over-indexing on. For example, we are using “IDF” and “TF” too frequently and want to scale back on those terms in favor of some of the under-utilized ones. However, it appears that we are using “corpus” “search engine” and “stop words” in the right frequency.
Recommended and Additional Keywords are lists of terms that your competitors are using but are not as crucial as Must Have keywords. Include Recommended terms where it makes sense and Additional terms where it is the only word that makes sense.
3. Optimize Your Brief
Based on the recommendations, we updated several of the mentions of TF-IDF and other related terms to improve the weighting of the post.
Changes made included:
- Changing all instances of TF*IDF to TF-IDF (the hyphen made a huge difference)
- Adding “search” wherever “engine” was mentioned by itself
- Using the full phrase (term frequency-inverse document frequency) where TF-IDF was overused
As a result, the Content Score improved from 79% to 89% and the Keyword Coverage improved from 49% to 61%. We did this while maintaining sentence structure and losing only 1 point in Readability. Based on the recommendation within Content Experience brief, a score of 75% was sufficient to be competitive so we landed that the post was highly optimized at 89%. Additionally, since this is a technical topic, losing a point off Readability was not a great loss since the other metrics like sentence structure, word count and keyword coverage remained fairly unchanged.
4. Review & Compare the Competitor Pages
The competitive edge that TF-IDF gives can not be overstated. By looking at term frequency against ranking pages, SEOs can see which competitors are over/under optimizing for a particular keyword and where they may be able to gain a competitive advantage.
In Content Experience, you can view competitor pages based on your brief topic. For TF-IDF, there were 18 pages, many reputable and authoritative sites. So how does an SEO worth their salt leverage this information?
They click-through and perform a manual analysis of the ranking pages! They review these pages structurally and thematically to see if there is anything they are missing. Some things to look for include:
- The length of content (which Content Experience provides)
- How detailed are the sections?
- What information can your company provide that your competitors can not?
For terms that your page does not use frequently, glean how ranking pages are using them.
- Do they have entire, dedicated sections?
- Is there a nomenclature that they are following?
Ryte’s tool also allows you to compare unpublished content to the targets and competitors. It looks at the text you add to their Content Editor and compares it against the terms related to your target keyword. The yellow dot shows how your content fares against the TF-IDF target for each phrase. The targets are based on the corpus of other pages.
We can also see that Onely mostly evenly distributes frequency across terms, where some of the other pages over-index on the more relevant phrases.
Keep in mind that this information is directional and no amount of lever pulling will ensure that your page will rank. Examine all of the relevant pages and read their copy. Take stock of how comprehensive, well-structured and well-linked their content is and devise a plan to optimize your page to be on an even playing field with the top rankers.
5. Discover Your Topic Universe
Another great feature of Content Experience is their Topic Explorer. It visually displays related topics in an interactive relationship chart. Click through terms to expand and add to your current brief.
If you add additional topics, be sure to go back through the process to optimize your content for the right term frequencies. You can view these topics by:
- Semantic Association
- Search Intent
As you edit your content, using a tool to leverage TF-IDF can help you build a page that goes beyond just keywords. Your content will take into consideration all of the related terms in your topic’s universe and be more robust and better poised to compete in the universe of pages.
Here are some great tools that are helping to bridge that gap:
SearchMetrics Content Experiences is our preferred enterprise tool. It allows for agile content development, optimization, streamlined workflow, and approval processes. TF-IDF is baked into the framework and it takes the guesswork out of content optimization,
OnPage offers a suite of optimization tools including their Content Success tools. Their unique TF-IDF offering allows users to see keyword recommendations at a detailed level, compare pages at every step as well as copy content into their Optimize feature and see which terms your content is lacking.
A program download, Website Auditor comes with a set of programs including Link Assistant, Rank Tracker and SEO SpyGlass. Website Auditor has a 7-day trial and provides keyword input and target pages to give users a TF-IDF analysis complete with scoring, ranking pages and content recommendations.
A free tool for 3 cases, SEObility provides its analysis in chart format and gives SERP details. It also allows users to view rankings from desktop and mobile.
How are you using TF-IDF in your content optimization / creation efforts?