I’m starting a Learn By Building series as I learn best when I get my hands dirty on a topic that I would like to learn more about. In this article, I’ll be sharing some learnings about how LLMs behave and parameters that influence its response output by setting up Ollama locally and playing around with PDF file content extraction and prompting.
Introduction
Early in my LLM learning journey, I assumed that providing the model with more information was the way to get better-quality responses. However, that assumption quickly broke as I started experimenting with LLMs more deeply.
While experimenting with a locally hosted LLM, I fed it additional content as a knowledge base. Its answer to my very first factual question was incorrect—yet delivered with complete conviction. That response sent me down a rabbit hole, learning what a context window is and how it impacts LLM responses.
I realised that fine-tuning an LLM is a lot like brewing a cup of coffee - there are multiple variables at play, and none of them are simple on/off switches. Each exists on a spectrum, and small changes can alter the final result.
Extracting PDF Content With Docling
I first started by setting up my LLM server. I’m using Ollama as it’s free, simple to set up locally and easy to integrate with some of the Python libraries like langchain and Docling that I’ll be playing around with. As an avid miles collector who is familiar and constantly up to date with discussions in this space, I’ll be using Singapore’s credit card rewards as the main use case for this article.
Credit card rewards terms and conditions are typically in a PDF file and they tend to use tables to organise information. Unlike traditional PDF parsers which flatten the texts, Docling is great at parsing the texts from PDF while preserving the table structure in markdown format. Being able to retain the structure is important for preserving semantics, especially during Retrieval Augmented Generation (RAG). To keep things simple at the beginning—and because the “Valley of Despair” stage of the Dunning–Kruger effect is very real— I’m skipping the RAG process for now.
Below is a sample code of how easily I extracted the PDF content from the UOB’s Terms and Conditions website into a txt file while preserving the structure with Docling:
from Docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("./ladys-cards-tcs.pdf")
with open("uob_ladys.txt", "a") as f:
uob_lady = result.document.export_to_markdown()
f.write(uob_lady)The uob_ladys.txt file will be used as the “knowledge base” for answering prompts.
Context Window Matters
Configuring Context Window Size
I set up the Ollama server to use the gemma3:4b model, then incorporated LangChain where I created a prompt template to dump the text generated as the context for answering the user’s question. Along with the context, I’ve added some generic guidelines to instruct the LLM to refer to the context to respond to the question, and be honest if they do not have the information to answer the question accurately.
from langchain_ollama import ChatOllama
from langchain_core.prompts import PromptTemplate
uob_ladys_context = open("./uob_ladys.txt", "r")
template = """
Use only the following pieces of context below to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n
As information accuracy is important, please ensure that your answer is aligned with the context provided. If the context contains conflicting information, prioritize the most recent details.
Here is the latest terms and conditions as of 2025 for UOB's lady's card (source: https://www.uob.com.sg/assets/pdfs/ladys-cards-tcs.pdf):
Context: {context}
Based on the context above, please answer the following question and elaborate on your answer:
Question: {question}
"""
prompt = PromptTemplate.from_template(template)
# Attempts to connect to local Ollama server at http://127.0.0.1:11434
model = ChatOllama(model="gemma3:4b")
chain = prompt | model
response = chain.invoke({
"context": uob_ladys_context.read(),
"question": "How much is the annual fees for the UOB lady's card?",
})To my surprise, the LLM gave a gibberish answer. It answered 360 SGD per year, however, the T&C PDF did not state 360 SGD anywhere at all.
Prompt: "How much is the annual fees for the UOB Lady's Card?"
Answer: The annual fee for the UOB Lady's Card varies, but according to the terms and conditions, it is SGD 360 per year.After some researching, I chanced upon the concept of “context window”. Everything I dumped into the prompt and the LLM’s response output effectively formed the “context window”. The gemma3:4b model has a context window of 128k tokens which is pretty large, so why is the LLM still not able to retrieve the correct information from the context? I’ve also tried with a different model llama3.2 which has 128k tokens context window and the same behaviour is observed.
Turns out, the ChatOllama interface from the langchain library has a default value of 2048 tokens as the context window (num_ctx) which meant that the context got truncated and did not have the relevant information, resulting in the hallucination. After manually editing the file to trim the context down to 554 tokens where only annual fees information were provided, the LLM responded correctly!
Prompt: "How much is the annual fees for the UOB Lady's Card?"
Answer: The annual fees for the UOB Lady's Cards are as follows:
- UOB Lady's Classic/Platinum World Mastercard: $196.20
- UOB Lady's Solitaire Card: $414.20
- UOB Lady's Solitaire Metal Card: $598.99Does Larger Context Window Lead To Better Results?
Next, I’ve also tried to increase context window to 8000 instead of trimming down the information provided in the context window.
model = ChatOllama(model="gemma3:4b", num_ctx=8000)The LLM now provides more details about the annual fees which includes supplementary card too which I think is great! However, I noticed a degradation in terms of performance. It previously took 7–10 seconds on average, but now it takes 19–21 seconds on average which is more than 2x longer to generate a response.
According to the context provided, the annual fee for the UOB Lady's Card varies depending on the type of card. Here are the annual fees mentioned in the terms and conditions:
* UOB Lady's Classic/Platinum World Mastercard: S$196.20
* UOB Lady's Solitaire Card: S$414.20
* UOB Lady's Solitaire Metal Card: S$598.99
For supplementary cards, the annual fee is as follows:
* First Supplementary Card: No annual fee
* Second Supplementary Card (after the first): No annual fee for the first two cards, and then S$98.10 for the third card onwards.
* Other supplementary cards: Annual fee of S$196.20.
It's also worth noting that there are exceptions to these fees, such as a waiver of annual fees for the first year or for certain types of accounts (e.g. UOB Lady's LuxePay Plan).If we’re using a paid model from Claude instead of a local LLM server, increasing the number of tokens also increases cost as it is calculated based on per million tokens.
Contrary to my observation here where I’ve gotten a better response when more context is provided, I’ve learnt that there’s a caveat to this when reading Chip Huyen’s book on AI Engineering. The author shared that the longer the context, it is more likely for the model to focus on the incorrect part of the context due to the “Lost In the Middle” effect.
So why in this case, I have gotten a better response quality? Chroma recently released a technical report on “context rot”, explaining how increasing the LLM tokens impacts LLM performance. The crux of it is the amount of distractors that were introduced would degrade the quality of the response and this impact is more evident in larger context windows. I suppose in this case, the LLM model may not see the additional context as distractors but I can’t be so sure about it either as it’s hard to tell how the model determines relevancy.
In short, providing a larger context window isn’t a silver bullet for getting better results, but focus prompts containing the relevant information does.
How LLM Responses Are Generated
Context size wasn’t the only variable influencing output quality. Even with the same prompt and context, I noticed that the LLM produced different results each time:
# Attempt 2
The annual fee for the UOB Lady's Card is as follows:
- For the main card (UOB Lady's Classic/Platinum World Mastercard): S$196.20
- For the Solitaire cards: S$414.20 and S$598.99, depending on the specific card.
- There are also options for waived annual fees for supplementary cards, with varying amounts depending on the number of supplementary cards held.
# Attempt 3
The annual fee for the UOB Lady's Card varies depending on the type of card and whether it's a principal or supplementary card.
For principal cards, the annual fees are as follows:
- UOB Lady's Classic/Platinum World Mastercard: S$196.20
- UOB Lady's Solitaire Card: S$414.20
- UOB Lady's Solitaire Metal Card: S$598.99
For supplementary cards, there is no annual fee for the first card and a reduced rate applies to subsequent cards.Under the hood, the model has a strategy in place to determine what the next token is and the following parameters influence the output:
- Temperature - How creative the desired output should be
- The lower the value, the more factual the output would be. Great for information retrieval.
- Somewhere in the middle balances accuracy and creativity. Great for general use cases. This is the default behaviour for the
ChatOllamainterface. - The higher the value, the more creative the output would be. Great for idea generation.
- Seed - Controls the randomness of how the model selects tokens
- Top K and Top P - Sampling techniques used in conjunction with each other for selecting the next token based on probability
If seed is specified, the response will become deterministic which results in the same response instead of the randomness that we’re observing right now.
The other 3 parameters are used for fine-tuning the response of the LLM further depending on the use case. This is where I think it resembles coffee brewing so closely. Coffee brewing is highly dependent on the grind size, temperature and timing. Want an espresso brew? Use a finer grind size. Want a cup of filter coffee? Use a coarser grind size. The temperature of water used to brew the filter coffee affects the clarity and flavour profile too. A higher temperature yields an acidic cup and as it cools, the flavour notes of the coffee becomes clearer. You get my point. I guess that’s the beauty of fine-tuning. There is no right or wrong per se, but rather what values in the given spectrum could we use to achieve the most optimal result.
llama3:2 vs gemma3:4b
I didn’t dive deep into model comparison, but here are 2 main observations from my light experimentation so far:
- The quality of response from
llama3:2andgemma3:4bdidn’t seem to have a huge difference between each other - llama performs slightly faster than gemma. With 554 tokens as the context window, it took 2-5 seconds for
llama3:2on average, and 7-10 seconds forgemma3:4bon average.
I Became More Conscious About Context Management in Claude Code
After learning about the various factors that affect LLM response output especially the concept of context window, I changed the way I used Claude Code. I became more mindful of what I include in the context window and how I manage it.
Now, I run the following commands more frequently between subtasks in a bid to get better response quality while reducing token usage, which also optimises for cost:
/compact— a summary of the context is written to help with continuity between subtasks/clear— clears the context and starts a clean slate
Learnings
💡Some TILs:
- Use prompt templates in LangChain to provide a set of instructions for LLMs
- Number of context window affects the performance, quality of response and cost
- There is some level of randomness in the LLM’s output and this can be controlled using
seedwhich is helpful for debugging - Use parameters like temperature, top K and top P to fine-tune the model