TermRAG, TAG & Co.—Many Names, Many Possibilities

Generative AI combined with either large language models (LLMs) or small language models (SLMs) [1] has become an integral part of many professional communication and documentation departments, even though it only appeared on the scene a few years ago. With generative AI increasing the efficiency and productivity of copywriters, technical writers, and other content creators, it is increasingly being used to facilitate the creation and translation of information products.

The contents at a glance:

Nevertheless, technical documentation must still meet other requirements regardless of any efficiency considerations, such as comprehensibility and legal certainty of the information products—whether AI is used or not. This often goes hand in hand with other quality requirements such as recognizability and consistency in corporate communications [2].

If we narrow our focus down to terminology in particular, the reasons for establishing standardized terminology processes in companies often relate to the quality benefits that arise from the consistent usage of terminology and thus greater text comprehensibility [3]. However, it is well known that the lack of terminological consistency and the use of non-company-specific terms is one of the biggest weaknesses of AI-generated texts when working with generic LLMs [4, 5].

LLMs and terminology requirements

LLMs can be 'forced' to use certain terminology in a number of ways [4]. In simple cases, the terminology can be stipulated by the user in the prompt at the end of their actual query by just inserting all the information from a terminology database as plain text. The LLM then processes this terminological information together with the actual query and decides which parts of the terminology database are relevant for its answer.

Similarly, some AI chatbots can also be provided with additional information in a file: in other words, the user can export the entire terminology database and uploaded it to the chat. Alternatively, terminological instructions can also be stored in the system prompts.

All these methods are user-friendly and technically easy to implement, although they do have their disadvantages. While terminology databases can contain thousands of concepts with detailed additional data such as definitions, designations, and usage status, LLMs can only process a limited amount of information (measured in so-called "tokens"). Referred to as the "window", exceeding this limit causes the LLM to omit the extra terminology specifications and conversation details, which ultimately leads to an unsatisfactory result [6]. Terminological consistency is one of the qualities that can be negatively impacted, even though the correct terminology was actually provided to the model.

Another disadvantage of these more simple methods is that the LLM itself has to decide which parts of the terminology database are relevant for answering the current query. Practical experience has shown that the LLM often makes the wrong decision here. In addition to these two problems, companies must be alive to the dangers of feeding commercially sensitive or copyright-protected data into an LLM.

Terminological data and RAG

The so-called RAG technique (Retrieval-Augmented Generation) has been developed in recent years to overcome the problems mentioned above when linking external data to an LLM [7, 8]. The basic idea is to pre-select only those sections from the available documentation (known as a knowledge base) that are actually relevant before any further processing takes place in the LLM. Relevant sections can be selected using different information retrieval techniques depending on the application scenario and the nature of the documents in the knowledge base. This means that the LLM only has to process as much information as necessary, which can largely counteract the "window" problem and protect sensitive data at the same time.

TermRAG, TAG & Co. – viele Namen, vielfältige Einsatzmöglichkeiten

Yet the LLM can benefit from much more than just information on which terms to use. Within the terminology community, the use of terminology databases for knowledge management has been a topic of hefty discussion for many years now [3, 4, 9]. Linking terminological data to LLMs using RAG can also unlock further benefits thanks to the additional information that these databases contain: content such as term definitions and other useful details enables the company-wide understanding of concepts and terms to be fed into the LLM. In this way, an LLM-based, terminology-supported chatbot could play an important role in the transfer and exchange of knowledge in both internal and external company communications.

The good news is that using terminology databases as a knowledge base for RAG is already providing good results without the terminology database needing to contain an explicit term system or be available as formal knowledge graphs [6, 10]. Existing terminology databases already create added value for LLMs if they follow the simple concept-oriented structure. However, academic studies have shown that explicit concept relationships hold even more potential for LLMs and therefore for further application scenarios of terminology databases [11].

TermRAG, TAG & Co.—Implementation

These and other potential applications for connecting terminology databases to LLMs using a RAG pipeline are currently being widely discussed and tested in theory and in practice [5, 6, 10]. It is a topic that has even found its way into terminology training [12], where the associated technology often has a number of different names in the terminology community (such as TermRAG [12], TAG [6], or M-RAG [13]). While we cannot examine the content of these discussions in any great detail here (such as the favorable data format, the appropriate retrieval technique, or prompt engineering), we will outline the current technical possibilities.

The technical implementation options for TermRAG are developing at a rapid pace, as is the industry as a whole. Until now, TermRAG has often been designed as a standalone solution based on the implementation options of a general, non-terminology-specific RAG pipeline. This means that a solution is needed within an organization's own IT infrastructure with self-programmed technologies [10]—together with all the inherent advantages and disadvantages that this entails. User-friendly RAG infrastructures that do not require any programming knowledge and which can also be used for special TermRAG applications are increasingly being made available for research purposes [14, 15]. The first commercial solutions that offer direct integration with existing terminology management systems [6, 13] are also starting to be released.

Conclusion

TermRAG is a topic of hot discussion in the terminology community, and rightly so. However, RAG solutions for commercial users can be thought of more broadly, as the knowledge base can contain different document types for different tasks. Terminology databases can therefore be seen as a building block in a more complex company-internal RAG infrastructure.

Studies and practical trials confirm that existing, concept-oriented terminology databases also prove their worth in the context of generative AI. They offer a high reusability value and improve the quality of text generation and translation in technical documentation. Terminology databases therefore continue to make an important contribution to precise, efficient, and legally compliant technical communication.

TermRAG, TAG & Co.—Many Names, Many Possibilities

Prof. Dr. Karolina Suchowolec

The contents at a glance:

LLMs and terminology requirements

Terminological data and RAG

TermRAG, TAG & Co.—Implementation

Conclusion

Further Reading

More on this topic:

Kaleidoscope Newsletter

Kaleidoscope: Taking your content global