Fowler & Subramaniam on GenAI patterns

Emerging Patterns in Building GenAI Products

The team at ThoughtWorks have distilled several patterns that they've found useful in building GenAI products. Some of these patterns have emerged from the new range of problems that these tools present: "including hallucination, unbounded data access and non-determinism."

Direct prompting

LLM will not know anything beyond what it was trained or things that happened since. May not know the context it's operating in and therefore it may not prioritize it's knowledge.

Habits of misleading replies and over confidence. Hallucinate.

In order to use this pattern you need to figure out how good the results of the prompting is. Thought works has found that a strong emphasis on a systematic and deliberate testing of AI systems is crucial to allow for changing them in various ways in the future.

Evals

Gen AI cannot be evaluated deterministically therefore you must use a scoring engine of some sort. You can evaluate a model based on individual output however more often you use a range of scenarios.

You can use a few different ways to evaluate a model's outputs: One is to have the model self evaluate. Next is to have another LLM model b the judge. Lastly a human evaluation is also quite useful when you have to see if the model quote unquote gets it.

Generally LOM has judge and combining with human evaluation works the best. This leverages automated feedback and human feedback as well which is crucial to know if the tone and clarity of the messages correct.

Evals will have to be tested based on a threshold rather than simple pass or fail given that they are non-deterministic.

Indeed it is best to establish a benchmark collecting various data points and charting their variance over time.

Scoring and judging is still an inexact science given that we are still in the early days of this tech. It will continue to evolve and change.

Embeddings

Encoding's capture the meaning of data which allows for better comparisons and relationships. However they are not ideal for structured or relational data.

Embeddings provide a way to index large quantities of unstructured data.

Rag

Rag is an effective tool for use with specialized knowledge bases. However you have to present it with the correct documents.
One of the best ways to do this is to leverage embeddings. You need to let it know that it needs to determine when it doesn't have sufficient data. And to also use this data explicitly.

Rag allows you to add data beyond what a model is trained on and beyond the date that it was trained on. It is great for rapidly changing data.

The contacts provided can help mitigate biases in the original training.

Generally, Rag is much more cost effective than fine tuning.

Chunk retrieval by itself isn't efficient.

"we've found we can tackle most of our generative AI work using Retrieval Augmented Generation (RAG)."

RAG Limitations

Inefficient Retrieval

"relying solely on document chunk embeddings in a vector store won’t lead to efficient retrieval"

"dense embeddings are good at finding similar paragraphs, they inevitably lose some semantic detail. No amount of fine-tuning can completely bridge this gap."

Use Hybrid Retriever to mitigate these issues.

Minimalistic User Query

Oftentimes a User's query might not be specific enough or have enough data. This can lead "to less accurate and more generalized results."

Use Query Rewriting to mitigate. Another option might be providing already formed options for the user.

  • combine the results

  • crucial for complex searches. Searches can often be improved by using the proper terms.

  • Extra cost associated

Context bloat

Currently, LLMs (even LLMs designed for large-context) have issues with large context and will ignore details in the middle of the context.

Use the Reranker pattern to mitigate.

  • works well with an overly large result set

  • cross-encoder like bge-reranker-large

  • Enhances the accuracy and relevance of documents

  • Useful for incorporating a user's preferences

  • adds cost

Gullibility

Over-confidence, hallucinating, revealing secrets, etc.

Use Guardrails pattern..

  • input and output guardrails

  • generally a specific model

  • can implement the self_check_input rails of NeMo Guardrails framework.

  • can also use embeddings given that they are more flexible than rigid rules

  • used Semantic Router to route or reject

Fine-tuning

Fine tuning a model incurs significant skills, computational resources, expense, and time. Therefore it's wise to try other techniques first, to see if they will satisfy our needs - and in our experience, they usually do.

Fine-tuning is usually not the best approach given that it's expensive to do.

First try different prompting tweaks. Then Introduce Rag and then finally Fine Tuning.

RAG might not be enough and might not supply enough context or too narrow.

"backpropagation, where errors are propagated backward through the model to update its weights, improving future predictions."

Full fine-tuning every part of the model is affected.

Using Selective layer fine-tuning, you can achieve significant improvements by selectively fine-tuning specific layers.

Parameter-Efficient Fine-Tuning (PEFT) adds and trains new params using Low-Rank Adaptation (LoRA) or Prompt Tuning

without changing the base model's parameters.

The majority of your efforts will go into data preparation and curation. Look at synthetic data generation.