Similarity-Based Search in Recommendation Systems
Recommendation systems are an integral part of online services, driving customer engagement, increasing sales, and providing users with personalized content. Whether it's suggesting movies or shows on an online media streaming platform, recommending products on an online retailer, or presenting users with potential matches on a dating platform, the backbone of these systems often revolves around one concept: similarity. Understanding and calculating similarity is fundamental to any recommendation engine, and modern graph databases offer unique advantages in handling the complexity of similarity searches.
In this post, we will explore the importance of similarity in recommendation systems, the technical challenges that arise in measuring it, and why graph databases are uniquely equipped to tackle these challenges. Using examples from everyday life, we’ll break down how similarity is calculated, why dimensionality complicates things, and how graph-based approaches can revolutionize recommendation systems.
What is Similarity in Recommendation Systems?
At its core, similarity in recommendation systems is about identifying items that are related or "similar" to an item that a user has engaged with.
There was a common meme about Amazon's recommendations a few years ago where if you bought, say, a toilet seat, it would start recommending you additional toilet seats (such different brands, colors, or even the same product by different sellers).
While not exactly the most effective upselling strategy, this is a very good example of similarity search. Other bathroom products really are very similar to the one you just purchased, and these similarities occur in many dimensions: form, function, branding, product naming, pricing, affinity with other purchases, and so on.
Of course, today, Amazon is much more likely to show you related products like faucets, cleaning wipes, or a bidet. These are items that are similar, not because they're in the same product category, but because they share some affinity (i.e. often seen together) with the first product.
Beyond retail, the concept of similarity plays a role in nearly every industry, from the more obvious examples such as online dating and media streaming, to some more obscure ones such as fraud detection and even tumor detection in medical imaging.
In general, this similarity metric is the key capability that allows machine learning systems to determine what actions are likely to increase the probability of a desirable outcome, and what actions might decrease the likelihood of an undesirable one.
Measuring Similarity and Clustering
The challenge in recommendation systems is determining how to measure similarity in a meaningful way. For humans, it's easy to compare two objects based on observable traits like color, shape, or texture. However, for computers, this process is more complicated. Computers rely on digitized data fields—text or numeric values—but lack the inherent understanding of the context behind those values--the semantic meaning--that is required to properly weight each field when considering similarity. For example, while a human can intuitively understand that black and dark brown are closer in color than black and light blue, a computer might require specialized algorithms to tell it how to quantify that similarity.
This is where clustering comes into play.
Clustering algorithms help identify groupings of items where items in a group are more similar to other items in the group than they are to items in other groups. Clustering algorithms do this by analyzing shared attributes across different dimensions. For example, if a product’s color is "black," a clustering algorithm might reveal that most black items tend to fall within a certain size range, price bracket, or category. By identifying these groupings, recommendation systems can refine their suggestions to show users items that are likely to be relevant to their preferences along those dimensions.
(The Curse of) Dimensionality
As we add more dimensions to measure similarity—such as color, size, price, brand, etc.—we run into a problem known as the curse of dimensionality. This refers to the phenomenon where, as the number of dimensions increases, the relative distances between objects in the data space become smaller. This makes it difficult to effectively differentiate between objects, even if they seem similar along several dimensions.
High-dimensional data is also expensive to process, both in terms of computation as well as in terms of storage. Traditional database structures struggle with the complexity of managing searches over high-dimensional relationships, as they quickly turn into full heap scans of the raw data, where the database gives up on using an index and decides to just brute force read the entire dataset. Naturally, this results in bad performance, since your data system is now going through the entire haystack of data to find a few needles, instead of accessing those needles directly (which is what the index is supposed to help with).
And this is the part where we talk about graphs.
How Graph Databases Solve the Dimensionality Problem (for the Similarity Measurement)
When it comes to distance, graphs offer an alternative approach besides distance-in-space: they let you measure distance using the number of edges you'd have to traverse—or relationships you’d have to cross cross—in order to get from one item to another.
If you've ever played Six Degrees of Kevin Bacon, you've been dabbling in searching for shortest paths on a graph!
More generally, distance between two points on a graph is either the number of edges forming the shortest path in an unweighted graph, or the sum of the weights of those edges in a weighted graph.
For a concrete example, a product might be linked to several attributes such as color, size, and price. These relationships edges allow the recommendation system to traverse the graph and find all other items that share similar attributes.
Consider searching for a product that is black, medium-sized, and costs between $15 and $20. In a graph database, the system can quickly traverse nodes connected to the three nodes representing these specific attributes and return a list of products that match the query—that are directly connected to all three nodes. The processing is fast, since you only need to cover the product nodes that are directly connected to each of the three nodes. It is also accurate and exhaustive, since you’re sure to get all products that are connected to those nodes. And finally, it is minimal—this search doesn’t require scanning the entire dataset—only the relevant nodes that are actually connected to the properties we care about.
Example: Birds of a Feather Shop Together
In addition to showing items that are similar to the one a user is currently viewing, recommendation systems can also leverage similarity between users.
For example, Amazon might notice that you recently bought these five items:
Toilet paper
A screwdriver
Dish soap
Dishware
A set of assorted kitchen items
Instead of simply looking for other users who purchased these exact items, the system can find users who shopped in the same categories--who purchased items from the kitchen, toiletries, tools, and cleaning categories. Once those users are identified, the system then looks at other items they have purchased, and broadens the recommendation search space to items like those purchased, and items in the same categories as those purchased.
The actual recommendation algorithms used on these online stores can get much more sophisticated than this, but one of the key mechanisms they rely on is the ability to measure affinity, which is like a suggestion of a relationship, even if an explicit relationship may not exist (or be known to the system).
Whether this approach is applied to online shopping, streaming media content, or even online dating, this relevancy expansion approach helps provide users with a more diverse selection of offerings, improves their overall experience and potentially even encourages them to spend more time on the platform.
However, there is a common challenge in building this kind of system: 2nd, 3rd, and higher-degree connections are expensive to measure in traditional data systems like relational databases. They require a sequence of expensive (and often recursive) JOIN operations, whose cost grows with the size of the input set (the space of possibilities, or all products for sale on an online retailer's platform, for example), rather than the size of the result (the number of recommendations you want to make). This means that as the platform grows, the cost of each individual recommendation increases, until it is no longer sustainable to support. This characteristic is called a Diseconomies of Scale.
This is where graph systems come in handy. Whether as a database or as a compute platform, graphs allow us to traverse these connections much more efficiently, looking only at data that is actually related to the input data set (an item purchased, in our example).
Of course, not all graph data systems do this well. In fact, many graph systems are themselves built on top of relational data systems, which means they suffer the same performance challenge. But they don't have to be built that way--Graphium Labs' HyperGraph, for example, is specifically built to avoid having to scan data that isn't directly relevant to the operation performed.
Fast Updates and Real-Time Recommendations vs. Batch Training
Another advantage of graph databases is their ability to handle real-time updates.
Since Deep Learning/Machine Learning models can be expensive to train, we often build them nightly or weekly, replacing the whole model whenever a new version is ready. While this allows us to leverage very advanced recommendation algorithms, it also means that the data used in recommendation is already stale by the time it's been deployed to production. This staleness means there is a lag between the time new products are added or viral trends emerge and when they are reflected in the recommendations. This can get costly: losing out on viral trends in social media, failing to recommend a trending or hot product, or missing a new type of fraud or attack as it's happening all have real bottom line consequences on a business. A common mitigation is building additional filtering and enrichment systems on top of the recommendation system, which increases cost and complexity, and makes cause and effect much harder to reason about when things go wrong.
In contrast, graph databases are capable of updating individual entities and relationships in real time. As soon as a new item is added, it becomes immediately connected to other items that share its attributes. This capability is particularly valuable for fast-moving markets, allowing the system to capture viral trends and recommend relevant products instantly.
Sparse and Dense Data Patterns
Graph databases also excel at handling both sparse and dense data patterns. In a densely populated area like New York City (population: 8.3m), a dating app might be able to recommend 30 highly similar profiles within a narrow similarity threshold. However, in a less populated area, such as Fairbanks, Alaska (population: 32k), the same threshold might only yield five matches.
When this happens, you have to choose between rerunning the search query with expanded thresholds (e.g. allowing up to 20% variation instead of the default 5%), or otherwise show only partial recommendations (e.g. showing 5 instead of 30 recommendations). Rerunning queries increases both operational cost and service latency, while returning partial results risks creating a subpar user experience and underutilizing their engagement (i.e. it costs you in opportunity for higher likelihood desirable outcomes, such as additional purchases or additional user engagements).
Another common approach to this is to build a second model for deciding the correct similarity threshold for different "regions" of data (in the online dating case the meaning of "region" is literal, but in other use domains it can be harder to pinpoint). Unfortunately, this approach suffers the same challenges discussed in the previous section: they can be slow to adapt to rapid changes or new and unfamiliar patterns, and they add a significant amount of complexity to the system.
On the other hand, graph databases offer the flexibility to adjust the search radius dynamically, expanding the graph traversal to find additional recommendations without sacrificing performance. You simply keep following more connections until you've collected the desired number of results.
This adaptability ensures that users in any environment—whether they’re searching for products or potential matches—always have a full set of recommendations to explore. In the fraud detection case, this capability helps reduce the chances of missing actions that are still detectable to a human observer, but just different enough that a computer system doesn't flag them as suspicious. In both the positive (recommendation) and negative (fraud or risk detection) cases, the ability to efficiently find similarities and follow up on 2nd and higher degree relationships is a real competitive advantage.
Conclusion: Graph Databases for Recommendation Systems
Graph databases excel at solving the similarity problem directly, because they:
Allow us to provide results quickly
Allow us to arbitrarily get as many results as we need
Allow us to continuously update the data in the graph, ensuring we always serve fresh and current results
Allow us to use the same algorithms for different categories of products or users
Allow us to the use the same algorithms for small data and large data
Are not sensitive to degradations due to high dimensionality
Side-step the issue of teaching a computer how to subjective weight feature dimensions when comparing different sets of objects (a problem similar to teaching "real meaning" to a computer--something even LLMs don't really solve)
In other words, graph data systems offer a robust way to approach the similarity or recommendation problem. Whether your business needs product suggestions, matching profiles, or flagging suspicious behavior, you would be worth while to look into how graph based solutions compare with the typical machine-learning over relational datasets approach.