The R&D lab for impossible architecture
Next-Gen LLM Runtime and Semantic Search
Scroll down to learn more ↓
Graphium Labs
Our thesis
Graphium Labs is a research and development lab focused on rethinking the foundations of software infrastructure and data systems in order to achieve step-change gains in efficiency, scalability, and capability.
We challenge core assumptions that modern systems are built on, not to iterate on existing designs, but to build better where they break down under pressure. Much of today’s infrastructure reflects incremental fixes over years and decades, which carry the choices made in past circumstances that no longer hold, contributing to poor scaling and efficiency characteristics today.
Our work is centered on identifying these structural limits and designing fundamentally new approaches to computation, data, and system architecture. Our guiding philosophy is to ask what these systems would look like if we invented them today, with only today’s and tomorrow’s constraints to solve for. Then in execution, we focus on the best ways to bridge the gap between what exists and what should exist, in an economically viable manner. The result is fundamentally new approaches for solving today’s and tomorrow’s problems, using the resources available today, with the efficiency, scalability, and capabilities required of the next era of computing in AI and information systems.
Lab Notes - What We’re Working On
-
Our premise is simple: Search should return good results based on what your query means. It should behave more like a research librarian that is well-versed in the domain of your query, and a lot less like a ranking or recommendation engine.
Unfortunately, today’s state-of-the-art search engines are optimized for returning one or two results you’re likely to engage with, rather than results that actually match the intent of the query. This is the direct result of building search on top of recommendation (including vector similarity search) and text-matching engines (such as Solr), with very little in the way of true semantic processing capability.
We believe that (1) a good search system should return results based on what you meant (semantics), not just what you typed (text matching), and certainly not just based on what drives engagement (recommendation/advertising), and (2) that this is possible to achieve with today’s hardware. That is, the limit is not one of processing capacity, but rather the combination of a lack of strong semantic processing capabilities and building on the wrong kind of search infrastructure.
Instead of surface similarity or text matching, next-gen relevance search should rely on true semantic parsing. And instead of building on top of recommendation and top-k retrieval engines, which are designed to return a small number of high-engagability results on the first page, with no guarantee of exhaustiveness across the remaining pages, next-gen search should be capable of returning the full set of relevant information for any ad-hoc query, while minimizing the noise that a user has to manually filter through after getting the results. This is a form of query-defined (ad-hoc) category retrieval, as opposed with the top-k recommendations based search that is common today.
In order to address both of these limits, we designed a special neural semantic graph processor that takes text input through several stages of syntactic and semantic processing, and then builds a novel neural semantic graph signature which represents meaning. The neural semantic signature is indexable and comparable, allowing us to leverage traditional information retrieval data architectures, which scales significantly better than vector and graph search, and enable a semantic search system that is both significantly better at identifying and matching meaning, and significantly more scalable and efficient than the alternatives.
Our initial focus has been on developing and improving the quality of semantic processing for meaning extraction and query-document relevance evaluation. And while we are still working on getting even better semantic relevance processing capabilities, we have also begun working on the surrounding indexing and retrieval infrastructure with early alpha partners.
-
The starting assumption today is that the only way to run Large Language Models is via expensive large scale GPUs. This stems from LLM architecture requiring large scale vector operations happening in regular sized blocks.
So we started asking if the same result could be produced with a leaner architecture. For example, could we take advantage of the sparsity of model weights and run more irregularly shaped, but fewer calculations? The irregularity would make them a poor match for GPUs, but would fit well with CPUs. Furthermore, if the irregular shaped calculations would force us onto CPUs, could we take advantage of other approaches that CPUs would excel at, such as graph-based short paths, circular communication, region based quantization, and more?
The goal is not to corral a large number of CPUs to match the floating point operations and memory bandwidth of a GPU, but to instead eliminate computations, compress data, and reduce copying, such that our runtime could run an existing Open Weight model at similar or better price performance ratios on CPU hardware, while maintaining full compatibility with the models’ architecture.
We’ve established a baseline where our graph based runtime achieves the same performance and quality as a CPU based LLM execution engine on the same hardware. We’ve gone further to apply a number of optimizations to double our tokens per second while maintaining parity on quality (producing identical outputs). We’re continuing to apply optimizations to the runtime.
Our ultimate performance target is CPU-based inference at the same cost per token as GPU within 10% of the perplexity. This would open up a wide range of hardware to act as LLM runtimes and dramatically shift the unit economics of token generation, and enable more fully owned, self-hosted AI runtimes in areas where cost, security, privacy or regulatory compliance currently inhibit adoption and utilization.
-
Our experience working on the internals of the big data infrastructure has shown that data systems do not actually scale well. They operate well at a certain scale, and then their performance degrades rapidly.
We needed a system which was flexible in terms of data types that it supported and allowed for modelling data in a variety of ways, relational, key/value, and graph. Additionally, our workloads would vary from small high velocity and low latency computations, to large scale computations over significant datasets.
This led us to design a graph based computation engine that combined both data and compute, supporting a variety of data types and varied workloads. Core properties include:
Minimal to no overhead for edge connection storage
Computations directly executes upon the graph and are maximally parallel
Workloads, including system tasks, use a single execution system allowing for seamless online schema migrations
Allow for a variety of physical and logical layouts (nodes and edges, columnar or row based tables, keys and values, and more)
Support a number of addressing, partitioning, and sharding schemes to avoid impact of diseconomies of scale
The trade-off is that writing queries (computations) for our graph computation engine is not well served by existing graph query languages, such as SPARQL or GQL, and at this time the technology remains internal. But our in-house expertise allows us to use it across our innovations in semantic search and CPU based LLM runtime.
Why Now?
AI today is bottlenecked on inference capacity. That makes a higher efficiency inference engine a no brainer.
Good search remains as elusive as it’s ever been, while the use cases and the market are ready and waiting to get it as soon as it becomes available.
Both of these are high value problems with the potential to change market fundamentals, and are therefore worth pursuing deeply.
Every Breakthrough Starts With a Better Question.
We are pre-seed. Two things are true about where we are in the work: we have found something real, and we are doing foundational work on search and AI runtime that we believe will sit underneath a significant portion of how AI systems are built over the next ten years.
We are studying the parts of search and AI runtime that get assumed rather than researched. What we are finding is that some of those assumptions are wrong in ways that matter, and that correcting them opens up a different class of opportunities.
If you have backed research-led companies before and you understand why the layer nobody is looking at today becomes the layer everyone depends on tomorrow, we would like to talk.
