A New NLP Paradigm: Peritus.ai’s Experience with GPT-4
Zsolt Tanko, ML Engineer
March 30, 2023
Here at Peritus we’ve spent the last few months reinventing most of our data science tech to leverage OpenAI’s GPT family of Large Language Models (LLMs). In this post we’ll briefly outline the business context for our Natural Language Processing (NLP) work – what data we take in and what problems we solve with it – and make a case for how applying pretrained LLMs is a paradigm shift that has brought our data science efforts into tighter alignment with product development.
What we do
We work with support communities, originally in the domain of open source software but now extending out to security, healthcare, Web3, and game development, where users are aiming to solve technical problems in their domain. These might be about software (or hardware) features, versions, bugs, errors and logs, or security issues and are typically specific to their use-case or organization’s workflow, for example the infrastructure their organization has in place and how best to integrate a given piece of software with it. Questions can be technical and dense with details, sometimes elaborated with code, error logs, and references to documentation, or can be broader, about how software systems can be configured and coordinated to meet their needs.
From a NLP perspective, this makes the content we process semantically very rich: lots of categories, entities, intents. We pull in documentation, forum posts, systems of record, Slack, Discord, Telegram, and parse out people, questions, and context. That all becomes input for our big-picture modeling challenge:
This guiding question decomposes into several moving pieces that each call for a bespoke use of LLMs that’s fit-to-purpose.
How we do it
We need to discover structure in our multimodal data sources and distill that into relevant signals that are – and this is the essential bit – matched to the needs and intentions of users and stakeholders. These use-cases tell us what needs to be salient, what structure we’re aiming to find. They define our product offerings, which is the first level of organization that shapes how we interact with data, and goes like this:
- Users want answers to their questions: what documentation or existing questions and answers are the most relevant to their own, and which experts in the community are best suited to help.
- Stakeholders want to know what topics people are interested in (ontology), where people are having issues (sentiment), and what trends are emerging (analytics). They’re also interested in the experiences of beginners, of experts, and tracking participation and churn in their communities.
The essence of both is discovering an ontology of technologies, products, uses, sentiments, and problems as they show up in actual discussions. With that in hand, it’s possible to extract which topics a question is about, which topics a user has expertise with, and which topics the community is engaged with.
In the past, we used several in-house NLP models developed for specific purposes like extracting technologies or discovering intents, for example retraining BERT language models on millions of StackExchange posts to create a model that could extract technical terms from a document. This work was challenging and costly, and our data scientists regularly landed on the same conclusion: if only our model were much more expressive and we had much more training data, this whole thing would be a lot better.
And that’s where OpenAI’s GPT-3 came in. A massive language model trained on an enormous corpus of text that we could outsource language processing to, it evaporated the need to reinvent a tool for processing language effectively and allowed us to focus on building tech specific to our business needs. After all, we’re interested in the concepts and relationships conveyed by natural language and in fitting those to our products, so the capacity to point our engineering and data science resources directly at that level of abstraction was a major shift.
Embeddings and semantics
Vector embeddings are the intermediate stage of a LLMs processing of a text document. They’re the highly processed internal representation the LLM uses to produce its output and distill the semantic content of the text – the meaning, the concepts and their relationships, the intentions, the tone and style – after having removed the dependency on the natural language itself. With embeddings at hand, new operations make sense that didn’t for documents: we can ask for the distance between two embeddings, yielding information about how similar the documents are.
This rich measure of similarity, combined with more classical language processing techniques like frequency analysis, make it possible to extract keywords, their relationships, which contexts they’re used in, and the sentiments that appear when people talk about them. Concretely, we can tag a question with topics that it’s about, tag users based on what they talk about, match an expert with questions they’re qualified to answer, and match a question to documentation and other resources on the same topics.
This is all made possible for us thanks to OpenAI’s embeddings API, which currently produces vector embeddings for our documents that are easy to compare, as well as modern vector databases like Milvus. The capacity of OpenAI’s GPT family of LLMs to capture rich semantics is the heart of our current approach to NLP, and is the seed for a new paradigm in the field:
Our take is that it’s no longer economically justified for data science teams applying NLP to expend resources reinventing the language comprehension wheel, especially in light of the effectiveness and apparent necessity of massive compute and training data for producing a high quality LLM.
This is a truly exciting development: it removes the barrier to entry for working with text by eliminating the earliest (and maybe most difficult) stage of processing, freeing data scientists to focus on semantics, which is exactly what their science is pointed at distilling from data. We’re fully behind the new paradigm and eager to see what coming generations of LLMs will be able to offer.
Interacting with users and GPT-4
Embeddings are the base for our approaches to processing text data. We also need to produce text that people can understand in the form of readable summaries and tags, and for our Slack chat assistant that answers questions and directs users to resources like documentation. Here too GPT-3 reoriented our approach and it’s where we’re most excited to leverage our early access to the GPT-4 API.
Using embeddings, we can extract keywords, intents, and topics, but this information is mostly in the form of lists of words and phrases. It eventually needs to be made human readable in a way that’s ergonomic, as domain-informed labels, titles, and summaries. This is where GPT-3 and prompt engineering came in, making it possible for us to express our intent in natural language, combine it with semantics we’ve discovered, and produce quality natural language that’s suited to our needs.
The limited prompt length for GPT-3 meant that most of our prompts focused on summarizing keywords and information about their context. The greatly increased context window of GPT-4, combined with its capacity to understand nuanced and detailed relationships between concepts, opens new possibilities. Prompts can now ask for qualitative features of much more text (up to about 50 pages) and can involve much more complicated multi-part questions.
This capacity underpins our Slack assistant, allowing it to attune more finely to what users are asking and respond in much greater detail. Concretely, pre-GPT a user messaging the assistant got a list of recommendations and experts related to their problem, but not an actual natural language answer contextualizing the recommendations in light of their question. Now the assistant understands the question and explains an answer: it communicates relevant concepts fit to the user’s needs. That’s a difference in kind in terms of ergonomics.
Given how recently GPT-4 became available, we’re at the start of understanding what new use-cases can be developed, running into the happy problem of trying to sidestep our own assumptions about what’s possible and reground our goals with a first-principles mindset.
OpenAI’s GPT family of LLMs have reoriented how we approach NLP and how our data science team aligns its efforts to business needs. That’s thanks to:
- Making it possible for data scientists to focus on semantics – relevant concepts – by outsourcing language comprehension.
- Shifting the scoping of technologies we develop to fitting semantics to our use-cases, rather than developing technologies to process language into semantics.
- Leveraging GPT-4 as an interface between our models and the interactions users have with our products.
We’re extremely excited to see what new data technologies emerge as the LLM dust settles and that we’re positioned to be active participants in the future of NLP.
If you’re interested in a comprehensive overview of LLM technology and how it fits into an enterprise setting, we highly recommend the recent article from Greycroft. Thanks go to fellow data scientists Eric Laufer and Jayden Luo at Peritus for discussion and feedback on this post.