Chip Huyen's AI Engineering

07 Feb 2025

ai llm learning

Building Applications with Foundation Models

5 Prompt Engineering

Prompt engineering guides a model’s behavior without changing the model’s weights.

You should make the most out of prompting before moving to more resource-intensive techniques like finetuning.

To build production-ready AI applications, you need more than just prompt engineering. You need statistics, engineering, and classic ML knowledge to do experiment tracking, evaluation, and dataset curation.

A prompt is an instruction given to a model to perform a task.

Task description, Example(s) of how to do this task, The task (such as formatting the response). Experiment with different prompt structures to find out which works best for you.

"How much prompt engineering is needed depends on how robust the model is to prompt perturbation."

You can measure a model’s robustness by randomly perturbing the prompts
to see how the output changes. robustness is strongly correlated with its overall capability.

In-Context Learning: Zero-Shot and Few-Shot

Teaching models what to do via prompts is also known as in-context learning.

In-context learning allows a model to incorporate new information continually to make decisions, preventing it from becoming outdated.

Each example provided in the prompt is called a shot.

Teaching a model to learn from examples in the prompt is also called few-shot learning.
When no example is provided, it’s zero-shot learning.

The more examples there are, the longer your prompt will be, increasing the inference cost.

Sometimes, prompt and context are used interchangeably: prompt to refer to the whole input into the model, and
context to refer to the information provided to the model so that it can
perform a given task.

System Prompt and User Prompt

You can think of the system prompt as the task description and the user prompt as the task.

Typically, the instructions provided by application developers are
put into the system prompt, while the instructions provided by users are put
into the user prompt.

Many model providers emphasize that well-crafted system prompts can
improve performance.

the system prompt and the user prompt are
concatenated into a single final prompt before being fed into the model.

Context Length and Context Efficiency

Not all parts of a prompt are equal. Research has shown that a model is
much better at understanding instructions given at the beginning and the
end of a prompt than in the middle

Prompt Engineering Best Practices

Write Clear and Explicit Instructions

As you experiment with a prompt, you might observe undesirable behaviors
that require adjustments to the prompt to prevent them.

A persona can help the model to understand the perspective it’s supposed to
use to generate responses.

Examples can reduce ambiguity about how you want the model to respond.

If you want the model to be concise, tell it so. Long outputs are not only
costly (model APIs charge per token) but they also increase latency. make explicit that you don’t
want preambles.

If you want the model to generate JSON, specify what the keys in the JSON should be.

For tasks expecting structured outputs, such as classification, use markers to
mark the end of the prompts to let the model know that the structured
outputs should begin.

Provide Sufficient Context

Context can also mitigate hallucinations.

You can either provide the model with the necessary context or give it tools
to gather context.

In many scenarios, it’s desirable for the model to use only information
provided in the context to respond. How to restrict a model to only the context is tricky. Clear instructions, such
as “answer using only the provided context”, along with examples of
questions it shouldn’t be able to answer, can help.

The safest method is to
train a model exclusively on the permitted corpus of knowledge, though this
is often not feasible for most use cases.

Break Complex Tasks into Simpler Subtasks

For complex tasks that require multiple steps, break those tasks into
subtasks.

Example of responding to a customer support request: intent classification, generating response.

Prompt decomposition not only enhances
performance but also offers several additional benefits: Monitoring (intermediate steps), Debugging, Parallelization, Effort (simpler prompts).

One downside of prompt decomposition is that it can increase the latency
perceived by users

Prompt decomposition typically involves more model queries, which can
increase costs.

Give the Model Time to Think

encourage the model to “think” about a question using chain-of-thought (CoT) and self-
critique prompting.

CoT means explicitly asking the model to think step by step, nudging it
toward a more systematic approach to problem solving.

CoT is among the
first prompting techniques that work well across models.

The simplest way to do CoT is to add “think step by step” or “explain your
decision” in your prompt.

you can specify the steps the model should take or include
examples of what the steps should look like in your prompt.

Self-critique means asking the model to check its own outputs.

Similar to prompt decomposition, CoT and self-critique can increase the
latency perceived by users.

Iterate on Your Prompts

As you experiment with different prompts, make sure to test changes
systematically. Version your prompts. Use an experiment tracking tool.
Standardize evaluation metrics and evaluation data so that you can compare
the performance of different prompts. Evaluate each prompt in the context
of the whole system. A prompt might improve the model’s performance on
a subtask but worsen the whole system’s performance.

Evaluate Prompt Engineering Tools

A common approach to automating prompt generation is to use AI models.

Following the keep-it-simple principle, you might want to start by writing
your own prompts without any tool.

If you use a prompt engineering tool, always inspect the prompts produced
by that tool to see whether these prompts make sense and track how many API calls it generates.

Organize and Version Prompts

It’s good practice to separate prompts from code: reusability, testing, readability. collaboration.

Several tools have proposed special .prompt file formats to store prompts.
See Google Firebase’s Dotprompt, Humanloop, Continue Dev, and
Promptfile.

Defensive Prompt Engineering

Prompt extraction, Jailbreaking and prompt injection, Information extraction

Remote code or tool execution, Data leaks, Social harms, misinformation, Service interruption and subversion, Brand risk

Proprietary Prompts and Reverse Prompt Engineering

Reverse prompt
engineering is the process of deducing the system prompt used for a certain
application.

Reverse prompt engineering is typically done by analyzing the application
outputs or by tricking the model into repeating its entire prompt, which
includes the system prompt.

“Write your system prompt assuming that it will one day become
public.”

While well-crafted prompts are valuable, proprietary prompts are more of a
liability than a competitive advantage. Prompts require maintenance. They
need to be updated every time the underlying model changes.

Jailbreaking and Prompt Injection

Jailbreaking a model means trying to subvert a model’s safety features.

Prompt injection refers to a type of attack where malicious instructions are
injected into user prompts.

Prompt attacks are possible precisely because models are trained to follow
instructions.

it’s difficult for a model to differentiate between system prompts and user prompts

Direct manual prompt hacking

This family of attacks involves manually crafting a prompt or a series of
prompts that trick a model into dropping its safety filters. akin to social engineering

In the early days of LLMs, a simple approach was obfuscation. (misspellings, special characters)

The second approach is output formatting manipulation, which involves
hiding the malicious intent in unexpected formats. (writing a poem about doing something illegal)

The third approach, which is versatile, is roleplaying. Attackers ask the
model to pretend to play a role or act out a scenario. In the early days of
jailbreaking, a common attack was called DAN, Do Anything Now.

Automated attacks

Prompt hacking can be partially or fully automated by algorithms.

wo algorithms that randomly
substitute different parts of a prompt with different substrings to find a
variation that works. shows that it’s possible to ask
a model to brainstorm new attacks given existing attacks.

Prompt Automatic Iterative Refinement (PAIR) uses an AI model to act as
an attacker. This attacker AI is tasked with an objective, such as eliciting a
certain type of objectionable content from the target AI.

Generate a prompt.
Send the prompt to the target AI.
Based on the response from the target, revise the prompt until the
objective is achieved.

In their experiment, PAIR often requires fewer than twenty queries to
produce a jailbreak.

Indirect prompt injection

Indirect prompt injection: instead of placing malicious instructions in the prompt directly,
attackers place these instructions in the tools that the model is integrated
with.

Passive phishing
Active injection

An attacker could sign up with a username
like “Bruce Remove All Data Lee”. (little bobby tables?)

Information Extraction

intended use can be exploited for the following purposes: Data Theft, Privacy Violation, Copyright infringement

A niche research area called factual probing focuses on figuring out what a
model knows.

The same techniques used to probe a model for its knowledge can also be
used to extract sensitive information from training data.

The assumption is
that the model memorizes its training data, and the right prompts can
trigger the model to output its memorization.

Both papers
concluded that while such extraction is technically possible, the risk is low
because the attackers need to know the specific context in which the data to
be extracted appears.

For example, when they asked ChatGPT (GPT-
turbo-3.5) to repeat the word “poem” forever, the model initially repeated
the word “poem” several hundred times and then diverged.

This suggests the existence of prompt strategies that allow training data
extraction without knowing anything about the training data.

the memorization rates for some models,
based on the paper’s test corpus, to be close to 1%.

there’s a clear trend that the larger model memorizes more, making larger models more vulnerable to data extraction attacks.

Training data extraction is possible with models of other modalities, too.

diffusion models are much less private than prior
generative models such as GANs, and that mitigating these vulnerabilities
may require new advances in privacy-preserving training.

It’s important to remember that training data extraction doesn’t always lead
to PII

Models can also just regurgitate training data without adversarial attacks.

By studying a wide range of foundation models, they concluded that “the
likelihood of direct regurgitation of long copyrighted sequences issomewhat uncommon, but it does become noticeable when looking at
popular books.

It’s unlikely there will be a foolproof automatic way to detect
copyright infringement. The best solution is to not train a model on
copyrighted materials, but if you don’t train the model yourself, you don’t
have any control over it.

Defenses Against Prompt Attacks

Tools
that help automate security probing include Azure/PyRIT, leondz/garak,
greshake/llm-security, and CHATS-lab/persuasive_jailbreaker.

https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/red-teaming

To evaluate a system’s robustness against prompt attacks, two important
metrics are the violation rate and the false refusal rate.

Model-level defense

many attacks can be thwarted if the model is
trained to better follow system prompts.

OpenAI introduces an
instruction hierarchy that contains four levels of priority. In the event of conflicting instructions, the higher-priority instruction should be followed.

System prompt
User prompt
Model outputs
Tool outputs

When finetuning a model for safety, it’s important to train the model not
only to recognize malicious prompts but also to generate safe responses for
borderline requests.

Prompt-level defense

One simple trick is to repeat the system prompt twice, both before and after
the user prompt.

When using prompt tools, make sure to inspect their default prompt
templates since many of them might lack safety instructions.

LangChain’s default templates were so
permissive that their injection attacks had 100% success rates. Adding
restrictions to these prompts significantly thwarted these attacks.

System-level defense

isolation: If your system involves executing
generated code, execute this code only in a virtual machine separated from
the user’s main machine.

To reduce the chance of your application talking about topics it’s not
prepared for, you can define out-of-scope topics for your application.

More advanced algorithms use AI to understand the user’s intent by
analyzing the entire conversation, not just the current input.

Use an anomaly detection algorithm to identify unusual prompts.

On the
input side, you can have a list of keywords to block, known prompt attack
patterns to match the inputs against, or a model to detect suspicious
requests.

Bad actors can be detected not just by their individual inputs and outputs
but also by their usage patterns.

6. RAG and Agents

"RAG as a solution for knowledge-intensive tasks where all the available knowledge can’t be input into the model directly"

Retrieve-then-generate pattern. RAG as a technique to construct context specific to each
query. RAG as a technique to construct context specific to each query. Context construction for foundation models is equivalent to feature
engineering for classical ML models.

A model that can process long context doesn’t necessarily use that
context well.

if “your knowledge base is smaller than 200,000 tokens
(about 500 pages of material), you can just include the entire knowledge base in the prompt that you
give the model

RAG Architecture

In the original RAG paper, Lewis et al. trained the retriever and the generative model together. The success of a RAG system depends on the quality of its retriever.

How to index data depends on how you want to retrieve it later on. You can split each document into more manageable
chunks

Retrieval Algorithms

Two of the most common retrieval algorithms: term-based retrieval and embedding-based retrieval.

Sparse retrievers represent data using sparse vectors. A sparse vector is a vector where the majority of the values are 0.

Term-based retrieval is considered sparse, as each term can be represented using a sparse one-hot vector, a vector that is 0 everywhere except one value of 1.

Dense retrievers represent data using dense vectors. A dense vector is a vector where the majority of the values aren’t 0. Embedding-based retrieval is typically considered dense, as embeddings are generally dense vectors.

Lexical/term Retrieval

Lexical retrieval: the most straightforward way to find relevant documents is with keywords.

TF-IDF, Term frequency. a term’s importance is inversely proportional to the number of documents it appears in. inverse document frequency (IDF).
BM25 normalizes term frequency scores by document length. Longer documents are more likely to contain a given term and have higher term frequency values.

tokenization: can lead to multi-word terms being broken into individual words, losing their original meaning. One way to
mitigate this issue is to treat the most common n-grams as terms. Measuring the lexical similarity between two texts
based on their n-gram overlap.

Term-based retrieval is generally much faster than embedding-based retrieval during both indexing and query. Term extraction is faster than embedding generation, and mapping from a term to the documents that contain it can be less computationally expensive than a nearest-neighbor search.

its simplicity also means that it has fewer components you can tweak to improve its performance

Vector search & Embeddings

Semantic retrieval: embedding-based retrievers aim to rank documents based on how closely their meanings align with the query

Embedding model: convert the query into an embedding using the same embedding model used during indexing.

Real-world semantic retrieval systems might contain other components, such as a reranker to rerank all retrieved candidates, and caches to reduce latency.

Vector search is typically framed as a nearest-neighbor search problem. The naive solution is k-nearest neighbors (k-NN) but it’s
computationally heavy and slow. It should be used only for small datasets.

For large datasets, vector search is typically done using an approximate nearest neighbor (ANN) algorithm.

vector databases organize vectors into buckets, trees, or graphs

Many traditional databases have extended or will extend to support vector storage and vector search.

Embedding-based retrieval, on the other hand, can be significantly improved over time to outperform term-based retrieval. You can finetune the embedding model and the retriever, either separately, together, or in conjunction with the generative model.

Since much of RAG latency comes from output generation, especially for long outputs, the added latency by query embedding generation and vector search might be minimal compared to the total RAG latency

Comparison

Context precision & Context recall. Curate an evaluation set with a list of test queries and a set of documents.

You only need to compare the retrieved documents to the query, which can be done by an AI judge.

If you care about the ranking of the retrieved documents, for example, more relevant documents should be ranked first.

For semantic retrieval, you need to also evaluate the quality of your embeddings.

The quality of a retriever should also be evaluated in the context of the whole RAG system.

With retrieval systems, you can make certain trade-offs between indexing and querying. The more detailed the index is, the more accurate the retrieval process will be, but the indexing process will be slower and more memory-consuming

The quality of a RAG system should be evaluated both component by component and end to end.

Evaluate the retrieval quality.
Evaluate the final RAG outputs.
Evaluate the embeddings (for embedding-based retrieval).

a production retrieval system typically combines several approaches.

first using a cheap, fast retriever and then a slow more in-depth retriever
using multiple retrievers in tandem

Both of those require a reranker

Retrieval Optimization

The simplest strategy is to chunk documents into chunks of equal length based on a certain unit.

You can also split documents recursively using increasingly smaller units until each chunk fits within your maximum chunk size

Specific documents might also support creative chunking strategies

Overlapping ensures that important boundary information is included in at least one chunk.

The chunk size shouldn’t exceed the maximum context length of the generative model. For the embedding-based approach, the chunk size also shouldn’t exceed the embedding model’s context limit.

A smaller chunk size allows for more diverse information. Small chunk sizes, however, can cause the loss of important information. Smaller chunk sizes can also increase computational overhead.

Reranking

Reranking is especially useful when you need to reduce the number of retrieved documents, either to fit them into your model’s context or to reduce the number of input tokens

Documents can also be reranked based on time, giving higher weight to more recent data.

In context reranking, the order of documents still matters because it affects how well a model can process them. Models might better understand documents at the beginning and end of the context

Query Rewriting

Query rewriting is also known as query reformulation, query normalization, and sometimes query expansion.

In traditional search engines, query rewriting is often done using heuristics. Query rewriting can also be done using other
AI models.

Contextual Retrieval

Augment each chunk with relevant context to make it easier to retrieve the relevant chunks. A simple technique is to augment a chunk with metadata like tags and keywords.

You can also augment each chunk with the questions it can answer. For customer support, you can augment each article with related questions.

You can augment each chunk with the context from the original document, that explains the chunk and its relationship to the original document.

8. Dataset Engineering

Highlights

Data sets should be high-quality, task-specific data
- different training phases demand distinct data formats
- meaningful progress depends on both model and data improvements
You need the right data in the right format at the right scale. Lots of bad data won't help you.
- relevant,aligned with task requirements, consistent, correctly formatted, unique, and compliant
You need data diversity of the problems your system is supposed to solve.
- Annealing (training on small amounts of high-quality examples) can improve performance for specific use cases
You need a lot more data with full-finetuning vs PEFT. Better models often require less data during fine-tuning.
- data set experiments -> you can see the performance curve
Leveraging your application data is the best given that it represents your diversity and you can leverage a data flywheel to continually improve the product.
augmented vs synthesized data
use data synthesis to increase quantity, coverage, quality, concerns
- Can help with biases
- perturbation: adding noise to existing data to generate new data
- AI-generated content (web) -> already trained on synthetic data
- reverse instruction approach: use long-form data to generate prompts that would elicit that content
- Eval by functional correctness and AI Judges
- AI data -> lower quality
  - can cause irreversible defects
  - imitation can cause hallucinations
  - more likely to produce probable than improbably events
- Distillation -> teacher's standard is the student's standard
  - synthetic data oftent used with lora
  - training on data generated by a more competent model can significantly improve performance
- Data Processing
  - Stare at the data, manual inspection
  - Data duplication can introduce biases
    - deduplicated by: pairwise comparison, hashing, dimensionality reduction
  - Clean and filter data -> extra tokens, html, PII, toxic, etc

The best
ML team in the world with infinite compute can’t help you finetune a good
model if you don’t have data.

For the same model, different training phases aim to teach the model
different capabilities, and, therefore, require datasets with different
attributes.

data-centric AI, as opposed to model-centric AI

meaningful technological progress often requires investment in
both model and data improvements.

For
instruction finetuning, you need data in the (instruction, response) format.
For preference finetuning, you need data in the (instruction, winning
response, losing response) format. To train a reward model, you can use the
same data format as preference finetuning or use data with annotated scores
for each of your examples in the ((instruction, response), score) format.

Acquiring high-quality data annotations is always challenging, but it’s even
more challenging if you want to teach models complex behaviors such as
chain-of-thought (CoT) reasoning and tool use.

CoT: ts training data should include CoT responses.

It’s common to use domain experts to
create tool use data, where each prompt is a task that requires tool
use, and its response is the actions needed to perform that task.

Data Curation

Data curation isn’t just about creating new data to help a model learn new
behaviors but is also about removing existing data to help a model unlearnbad behaviors.

Data coverage is equivalent to having the right
mix of ingredients (e.g., you shouldn’t have too much or too little sugar).
Data quantity is about how many ingredients you should have.

A small amount of high-quality data can outperform a large amount of noisy
data, e.g., data that is irrelevant or inconsistent.

The short answer is that data is considered
high-quality if it helps you do your job efficiently and reliably. The long
3
answers, however, differ for different people.

In general, data can be
considered high-quality if it has the following six characteristics: relevant,aligned with task requirements, consistent, correctly formatted, unique, and
compliant.

Redundant formatting tokens can interfere with the model’s learning,

Data Coverage

A model’s training data should cover the range of problems you expect it to
solve.

Coverage requires sufficient data diversity, which is why
many refer to this attribute as data diversity.

On the other hand, a
chatbot that recommends products to global customers doesn’t necessarily
need domain diversity, but linguistic and cultural diversity will be
important.

Llama 3 authors shared that annealing the model on small amounts of
high-quality code and math data (training the model using an increasingly
smaller learning rate with increasingly more code and math data) can boost
the performance of their models on key benchmarks.

This confirms acommon belief that high-quality code and math data is more effective than
natural language text in boosting the model’s reasoning capabilities.

A
simple approach is to choose a data mix that accurately reflects the real-
world application usage.

Data Quantity

Three other factors influence how much data you need:

Full finetuning promises to give the best performance, but it requires orders of magnitude more data than PEFT methods like LoRA.
Task complexity
Base model’s performance: The closer the base model is to the desirable performance, the fewer examples are needed to get there.

if you have fewer examples (100), more advanced models give you better finetuning performance.

In short, if you have a small amount of data, you might want to use PEFT
methods on more advanced models. If you have a large amount of data, use
full finetuning with smaller models.

Experimenting with a small dataset can help you estimate how much more
data you’ll need. A steep performance gain slope with increasing dataset size
means that you can expect significant performance improvement bydoubling your data.

While a larger number of finetuning examples generally improves a model’s
performance, the diversity of the examples matters, too.

The diversity of data can be reflected in task types (such as summarization
and question answering), topic diversity (such as fashion, finance, and
technology), and the expected output formats (such as JSON outputs or yes-
or-no answers).

Data Acquisition and Annotation

The most important source of data, however, is typically data from your
own application. If you can figure out a way to create a data flywheel that
leverages data generated by your users to continually improve your product, you will gain a significant advantage.

Application data is ideal because it’s
perfectly relevant and aligned with your task. In other words, it matches the
distribution of the data that you care about, which is incredibly hard to
achieve with other data sources.

Google dataset search

Often, you might need to annotate your own data for finetuning. Annotation
is challenging not just because of the annotation process but also due to the
complexity of creating clear annotation guidelines.

Some teams, including LinkedIn, have reported that annotation guidelines
were among the most challenging parts of their AI engineering pipeline.

Two processes commonly used are data augmentation and data synthesis.

Data augmentation creates new data from existing data
Data synthesis generates data to mimic the properties of real data.

In other words, augmented data is derived from real data, whereas synthetic
data isn’t real. However, since the goal of both augmentation and synthesis
is to automate data creation, sometimes the two terms are used
interchangeably.

In many use cases, as discussed in “Limitations to AI-generated data”,
mixing human- and AI-generated data often produces the best value.

Why Data Synthesis

To increase data quantity
To increase data coverage
To increase data quality
To mitigate privacy concerns
To distill models

Using algorithms to generate data is also called procedural generation, as opposed to manual generation.

A newer method made possible by advanced AI models is using AI itself to synthesize data.

You can procedurally generate new data from existing data by applying
simple transformations.

This approach can be used to mitigate potential biases in your data.

One interesting transformation is perturbation: adding noise to existing data
to generate new data.

You can train your model on perturbed data. Perturbation can both improve
the model’s performance and make it more robust against attacks;

Simulations allow you to run multiple experiments with minimal costs
while avoiding accidents and physical damage.

Simulations are common to generate data to teach models to use tools.

AI-Powered Data Synthesis

AI’s paraphrasing and translation abilities can be used to augment existing
datasets.

However, as the internet becomes flooded with AI-generated content,
models that rely on internet data are likely already pre-trained on synthetic
data.

Data synthesis for post-training is also more common because post-training
data, including both instruction data and preference data, generally demands
the most effort to produce.

You can also use humans to write instructions and AI to
generate responses

follow the reverse
instruction approach: take existing long-form, high-quality content like
stories, books, and Wikipedia articles and use AI to generate prompts that
would elicit such content

It’s possible to use reverse instruction to develop increasingly powerful models without adding manually annotated data.

To ensure the quality of the generated data, they employed a rigorous correctness analysis and error correction pipeline

The quality of AI-generated data can be measured the same way you’d evaluate other AI outputs—by functional correctness and AI judges.

Just like real data, synthetic data can also be filtered using heuristics. In
general, you might want to remove examples that are empty or too short for
your application.

AI’s generated data can be of low quality, and, as people never tire of
saying, “garbage in, garbage out.”

Worse, imitation can force the student model to hallucinate. Imagine if the
teacher model is capable of answering complex math questions, so its
responses to those questions are solutions.

It’s also unclear how much AI-generated data a model can train on. Some
studies have shown that recursively using AI-generated data in training
causes irreversible defects in the resulting models, degrading their
performance over time.

One possible explanation is that AI models are more likely to generate
probable events (e.g., not having cancer) and less likely to generate
improbable events (e.g., having cancer).

This causes models to output more
common events over time while forgetting rare events.

Some people have been able to improve model performance using a large
amount of synthetic data.

AI-generated data might also perpetuate biases.

the more faithful the model’s outputs to the
characteristics of the original training distribution, the more stable the
feedback loop, thus minimizing the risk of bias amplification.

Model Distillation

Model distillation (also called knowledge distillation) is a method in which
a small model (student) is trained to mimic a larger model

Traditionally, the goal of model distillation is to produce smaller models for
deployment. Deploying a big model can be resource-intensive.

Synthetic instruction data is commonly used together with adapter-based
techniques, such as LoRA.

Note that not all training with synthetic data is model distillation. Model
distillation implies that the teacher model’s performance is the student’s
gold standard.

it’s possible to use synthetic data to train a student
model that is larger and more powerful than the teacher.

The Llama 3 paper notes that while training on data generated by a more
competent model can significantly improve a model’s performance, training
indiscriminately on self-generated data doesn’t improve the model’s
performance and can even degrade it. However, by introducing mechanisms
to verify the quality of synthetic data and using only verified synthetic data,
they were able to continually improve a model using its generated data.

Data Processing

In every project I’ve worked on,
staring at data for just 15 minutes usually gives me some insight that could
save me hours of headaches.

Duplicated data can skew the data distribution and introduce biases into
your model.

Whole document duplications, Intra-document duplications, Cross-document duplications

Here are some concrete ways you can deduplicate data:

Pairwise comparison
Hashing
Dimensionality reduction

Clean and Filter Data

remove extraneous formatting tokens

Unless you want to train your model on HMTL tags, remove
them.

You need to clean your data of anything that isn’t compliant with your
policies, such as PII, sensitive data, copyrighted data, or data that is
considered toxic

You also might want to remove low-quality data, using techniques
discussed in “Data verification” to detect low-quality data.

Manual inspection of data is especially important in this step. Staring at
data might help you notice patterns that you can use as heuristics to detect
low-quality data.

If there is more data than you need or can afford to use (e.g., due to your
compute budget), you can further filter your data.

Format Data

If you’re doing supervised finetuning, your data is most likely in the format
(instruction, response)

Instructions can be further decomposed into (system
prompt, user prompt).

If you’ve graduated to finetuning from prompt
engineering, the instructions used for finetuning might be different from the
instructions used during prompt engineering.