Welcome to my notes! I regularly publish notes to better understand what I’ve learned and to explore new ideas. I would love to hear from you—leave a comment or reach out on Bluesky @alexkehayias.com or find me on LinkedIn.
Notes From the Field
-
Axum and Dynamic Dispatch
I kept getting an opaque rust compile errors when writing an
axum
API handler.Like this:
the trait bound `fn(axum::extract::State<Arc<std::sync::RwLock<AppState>>>, axum::Json<ChatRequest>) -> impl Future<Output = axum::Json<ChatResponse>> {chat_handler}: Handler<_, _>` is not satisfied
And:
the trait `Handler<_, _>` is not implemented for fn item `fn(State<Arc<…>>, …) -> … {chat_handler}` = help: the following other types implement trait `Handler<T, S>`: `MethodRouter<S>` implements `Handler<(), S>` `axum::handler::Layered<L, H, T, S>` implements `Handler<T, S>`
Turns out my handler code utilized dynamic dispatch using trait objects and that was incompatible with Axum’s async implementation.
See if you can spot the issue:
let note_search_tool = NoteSearchTool::default(); let tools: Option<Vec<Box<dyn ToolCall>>> = Some(vec![Box::new(note_search_tool)]); let mut history = vec![]; chat(&mut history, &tools).await;
The fix is actually straightforward, but the opaque error messages made this difficult to track down. By default, a trait object is not thread-safe. I needed to add some additional trait bounds to the usage of dynamic dispatch.
let note_search_tool = NoteSearchTool::default(); let tools: Option<Vec<Box<dyn ToolCall + Send + Sync + 'static>>> = Some(vec![Box::new(note_search_tool)]);
Adding a type alias cleans up these type signatures.
type BoxedToolCall = Box<dyn ToolCall + Send + Sync + 'static>; let note_search_tool = NoteSearchTool::default(); let tools: Option<Vec<BoxedToolCall>>> = Some(vec![BoxedToolCall::new(note_search_tool)]);
Published
-
What Does It Mean to Be AI Native?
With the capabilities of large language models getting more useful for real work, the pressure is on to incorporate them everywhere. I’m seeing an increase in loud voices proclaiming people and businesses must become “AI Native” if they are to survive.
While I wouldn’t put it in such absolute terms, competitive people who aim to do great work ought to take notice—moments of rapid progress and change are rare.
But what does it mean to be AI native?
Being AI native is to incorporate AI into the foundation of how you create and positioning yourself to take advantage of rapid progress in AI. For individuals, that means augmenting how they work to utilize AI to increase their productivity: improved efficiency, increased output, but also solving problems that were previously intractable due to limited resources. For businesses, that means building the culture and infrastructure to use AI safely, automation by default, and applying new techniques to solve customer problems faster and more completely (no, this does not mean you should build a damn chat bot).
The individual and the business go hand-in-hand. It’s going to be difficult for a business to become “AI native” if employees don’t enthusiastically engage with AI (it’s difficult to build an intuition of what it can do. Since this requires a change in culture, larger organizations will struggle while startups will succeed (this is an advantage we shouldn’t squander!).
In practice, I think it looks like this:
- Problem solving starts with an AI co-pilot or AI agent to rapidly get up to speed and explore the solution search space. For engineers that means building incremental improvements with AI-powered autocomplete or full features using an agent. For designers that means prompting to explore different solutions all at once and then refining it, before passing it to engineering with all the html and css already written.
- Recurring tasks are prototyped in workflow tools like n8n or Dify before being applied to everything. If you run into a problem trying to automate it go to item 1. For example, meeting follow-ups, lead nurturing, support, monitoring, and standard operating procedures.
- Internal tools and systems become significantly larger (probably larger than the customer facing application) and the internal platform provides access to data and actions that adhere business rules (for safety, security, and compliance reasons). These are designed primarily for use with other AI-powered tools, workflows, and one-off applications using code written by (surprise!) other AI-assisted workflows.
- The product and experience delivered aims to be a complete solution that is customized for each customer so that they are more directly paying for outcomes rather than software (while avoiding the infinite butler problem). The marketing of AI doesn’t matter so much as the solution delivered takes advantage of AI to get there faster or more completely.
- More one of one software is created by individuals specifically for their needs and preferences because LLMs significantly lower the effort for their payoff.
- Everything else that can’t be automated or stitched together in the day-to-day of running the business is sped up with faster communication. For example, voice dictation like Wispr Flow, progressive summarization in email (Gemini in Gmail), messaging (Slack AI), and documentation (Notion AI), and generative AI to respond quickly.
What else am I missing?
See also:
- Lump of labor fallacy shows that we won’t be losing jobs to AI, but the jobs will change (as this post I think demonstrates the difference)
- It’s hard to automate and build systems if you don’t have experience doing it—past experience is a repetoire not a playbook
Published
-
Typed Languages Are Best for AI Agents
Typed languages should be the best fit for useful AI agents. Context is needed for practical LLM applications and type systems provide a ton of context. Compiling code provides a short loop that can help the agent.
Strongly typed languages like rust are even better. Not only do you get great compiler errors for the agent to incorporate, you can easily tell when LLMs are subtly wrong. For example, when an AI agent was writing a function that used a library, it wrote it based on an older version so it didn’t compile.
See also:
- Static types make it easier work on projects sporadically when you don’t have time to page in the whole codebase into your working memory
- Static types might also help prevent mushy systems as AI contributes more to large codebases
- I tried out goose coding AI agent on some end-to-end features which went surprisingly okay
- More agents are coming as the market gets flooded with AI employees
Published
-
Install a PWA in MacOS
Apple keeps making it more and more difficult to “install” a PWA. At time of writing, there is no longer a “Share” button in the navigation bar so you have to go menu diving to find how to intall a PWA now…and by install I mean add it to the Dock—whatever that means.
- Navigate to the URL of the PWA in Safari
- Go to File -> Share
- Select “Add to Dock”
Published
-
Goose Coding AI Agent
I’ve been testing out
goose
an AI agent for writing code that runs on your machine instead of as an IDE co-pilot.Writing features for my personal indexing service (a rust codebase), I learned:
- For small, additive, self-contained features, it works well. For example, spawning a long-running task that used to be blocking, adding a client-side PWA service-worker, and adding a test function.
- Sometimes I found that
goose
would replace a line rather than add it—when creating a new module it overwrote a line inlib.rs
which caused another module to no longer be linked and fail to compile. This also happened when adding a dependency. - One time it overwrote a file and truncated it and never finished outputting the rest.
- Responds well to guidance about correcting issues. For example, when implementing the server side implementation for push notifications it used a library incorrectly (probably an old version), but when I provided the readme example as context, it fixed it up.
- Typed languages are probably best for AI agents
Published
-
Prospects Will Thank You for Disqualifying Them
When I started doing founder-led sales, I thought it was my job to find a way to sell to anyone who got in touch and scheduled a meeting. When you have no customers and no sales, it’s natural to want to sell to anyone. This is a mistake.
Qualifying might sound awkward at first. There are some questions to ask and answers you are listening for. I worried that it wouldn’t sound natural.
Then learned that disqualifying someone from buying your product is a good thing. So much so that prospects will go so far as to say, “Thank you for your honesty”, when I tell an tiny company with one employee that our product is overkill for them. Not only does this buy some good will because you didn’t waste their time, it leaves the door open for them to come back to you in the future because earned some trust. (I also recommend giving them very simple signs of when they should get back in touch!).
See also:
- Discovery questions in sales calls should feel consultative
- Customer success overcomes technical hurdles but not buyer mismatches
- Selling something to the wrong customer leads to leads to churn
Published
-
The Bitter Lesson
Ironically, one of the most efficient strategies for building with AI is to wait for better models.
Researchers try to address shortcomings of AI models by constraining them from general purpose tools to specific purpose tools. The exact opposite approach eventually obviates their work as other researchers improve models by scaling compute rather than through better algorithms and clever constraints.
This is relevant to today’s approach to AI using large language models. The models improve incredibly fast and one side-effect of that is foundation model providers regularly wipe out companies trying to build specific applications once the model becomes good enough.
Read The Bitter Lesson by Rich Sutton and AI Founder’s Bitter Lesson from Lukas Petersson.
See also:
- AI puts a higher premium on unique knowledge but the area of knowledge narrows over time as models get generally better at more things
- Gödel Incompleteness For Startups
Published
-
Why I Like Ring Binders - Plotter Notebook
The Plotter notebook from Designphil is a minimal ring binder for planning and writing. I use the A5 size for work and journaling in addition to org-mode for managing tasks and writing permanent notes.
Being a ring binder allows for customization as you go. Instead of worrying about leaving space to add on to something you just wrote, you can open up the rings and place another page anytime in the future. You can add different page templates as needed using off-the-shelf refills (the Plotter refill paper is quite good) or even print your own (I recommend a standard size like A5 if you plan to do this).
Being a ring binder means pages lay flat. This is such a useful feature not only for writing comfort (which is very important) but also for referencing the contents at a glance whilst sitting on your desk while working.
Writing with pen and paper is a welcome break from constant screen time. While far less efficient, research suggests analog writing is more effective for encoding information into long-term memory. I feel more focused because I have to fully concentrate on writing compared to typing.
See also:
Published
-
AI Browser Automation
An incomplete list of AI-powered web browser automation and AI agent projects.
Published
-
N8n for Automation
I started trying out n8n for setting up automated workflows.
I tried setting up a basic HubSpot filter to collect all deals in a certain stage. Unfortunately, the API doesn’t return the names of things, just IDs so I can’t tell by looking at it what the pipeline is or the stage because those are customized to our instance of HubSpot. This would probably not be possible for a non technical person to do.
How is it different than Dify?
n8n has out of the box integrations with many B2B SaaS tools like HubSpot. It also provides triggers like cron jobs and webhooks and Dify does not.
Notes on usage:
- Quirky UI: double click to edit, triple pane layout when editing with dismiss button on the left, when to use an expression or not for a field value, tools have to be linked visually to an LLM node
- When working off of large lists, speed up the feedback loop by only executing a node once while you test it out by going to Settings -> Execute Once (don’t for get to flip it back)
- Clicking the test button doesn’t actually seem to execute the full flow
- To extract fields from a large API response, use the Edit node (horribly named)
- Tools can only be used in conjuction with a language model (e.g. SERP can’t be used to make arbitrary search queries in a workflow)
- HubSpot has a horrible API and the built-in integration in n8n does not make it any easier (e.g. you need to map IDs yourself by making multiple API calls and using a Merge node to join them)
- Scheduled tasks don’t run until you activate the workflow
Published
-
Judgment Disqualifies Automation
When a process requires human judgement for an unknown number of possible decisions, automation is not possible.
Judgement is also inversely proportional to the outsource-ability of work.
Many startups have fallen into this tar pit by mistakenly believing they can replace high judgment work with automated systems or low-cost outsourced labor. It leads to lower quality and lower margins that might not be possible to escape from.
See also:
- Might this change as AI models improve?
- How to build an intuition of what AI can do
- Creative fields involve a lot of judgment and the “taste gap” which separates good quality from bad quality
Published
-
How to Be Productive
Here’s a collection of things that have helped me be more productive.
- Capture as a separate place to write down a task with as little friction as possible at any time. I do this ~20 times per day as things come up. The key is to get into the habit so you no longer spend your time constantly trying to remember what you need to do.
- Inbox zero as often as you can. It’s not just email, it’s any inbox or queue that you interact with like Slack, HubSpot, Notion, iMessage, etc. If you treat your inbox as a todo list, don’t. If you can answer it in a few minutes, do that. If not, see Capture.
- Be more decisive. Decision making speed is often my bottleneck and it’s too easy to deliberate on items that aren’t that important. This pairs well with Inbox Zero and Capture because if you can decide more quickly you can usually resolve more things in a few minutes.
- “Hell yeah” or “no”. If you have trouble saying no to things (personal or professional), try reducing it down to a simpler decision.
- Just try getting started with the first thing to overcome the activation energy to get a large thing done and you’ll naturally keep going.
See also:
- The Great CEO Within has a nice collection of useful tactics for getting things done
- Making a personal note taking practice might help knowledge workers more long term
- I find it helpful to build personal infrastructure and one of one software because coding is convenient for me
Published
-
Do Higher Temperatures Make LLMs More Creative?
Higher temperatures tell LLMs when generating a completion to not always use the highest probability next token. This has the effect of producing a wider range of possible responses.
How does temperatures effect LLM creativity?
Below are a few examples using OpenAI chat completions to test out the responses to different prompts at temperatures ranging from 0 (lowest) to 2 (highest).
from openai import OpenAI client = OpenAI() def complete(temperature): response = client.chat.completions.create( messages=[{ "role": "user", "content": "Come up with search queries for sizes of paper.", }], temperature=temperature, model="gpt-4o-mini", ) return response.choices[0].message.content for i in [0, 0.5, 1, 1.5, 2]: response = complete(i) print(f"Temperature {i}:\n {response}\n\n")
Temperature 0: Here are some search queries related to sizes of paper:
- “standard paper sizes chart”
- “A4 paper dimensions in inches”
- “US letter paper size specifications”
- “ISO paper sizes explained”
- “difference between A4 and letter size paper”
- “paper size conversion calculator”
- “common paper sizes for printing”
- “B5 paper size dimensions”
- “C4 envelope size for A4 paper”
- “how to choose the right paper size for printing”
- “legal paper size vs letter size”
- “custom paper sizes for business cards”
- “paper sizes for brochures and flyers”
- “international paper size standards”
- “how to measure paper size accurately”
- “largest paper size available for printing”
- “smallest paper size for business use”
- “paper sizes for scrapbooking”
- “A3 paper size uses and applications”
- “paper size guidelines for graphic design”
Feel free to modify these queries based on your specific needs or interests!
Temperature 0.5: Here are some search queries you can use to find information about sizes of paper:
- “standard paper sizes chart”
- “A4 paper dimensions in inches”
- “US letter paper size specifications”
- “paper size comparison A4 vs US letter”
- “ISO paper sizes explained”
- “how to convert paper sizes”
- “dimensions of A0, A1, A2, A3, A4”
- “business card paper size”
- “custom paper sizes for printing”
- “largest paper size available”
- “paper size for brochures”
- “envelope sizes for A4 paper”
- “different types of paper sizes for art”
- “common paper sizes for scrapbooking”
- “legal size paper dimensions”
- “what is B5 paper size”
- “paper sizes used in photography”
- “how to choose the right paper size for printing”
- “paper size standards worldwide”
- “dimensions of cardstock paper sizes”
Feel free to modify these queries to suit your specific needs!
Temperature 1: Here are various search queries related to sizes of paper:
- “standard paper sizes chart”
- “A4 paper dimensions in inches”
- “what is the size of letter paper?”
- “list of paper sizes”
- “ISO paper size standards”
- “how big is A3 paper?”
- “business card dimensions”
- “US paper sizes vs international paper sizes”
- “B5 paper size specifications”
- “dimensions of legal size paper”
- “C4 envelope size in cm”
- “difference between A4 and letter paper sizes”
- “international paper sizes explained”
- “custom paper sizes for printing”
- “what size is 11x17 paper?”
- “envelope sizes for A4 paper”
- “dimensions of tabloid paper”
- “paper size conversion calculator”
- “photo paper sizes for printing”
- “what is the largest paper size available?”
Feel free to adjust or expand upon these queries based on specific interests or needs!
Temperature 1.5: Sure! Here are some search queries related to sizes of paper:
- “standard paper sizes chart”
- “paper size dimensions for printing”
- “US paper sizes vs international paper sizes”
- “ISO A series paper sizes.”
- “A4 paper size dimensions”
- “letter sized paper dimensions in inches”
- “paper sizes explained”
- “how to convert paper sizes”
- “different paper sizes for crafts”
- “large format paper sizes”
- “list of envelopes with corresponding paper sizes”
- “cardstock paper size differences”
- “custom paper sizes for backgrounds”
- “most common paper sizes for documents”
- “tumblers paper size for sublimation”
- “business card sizes and their dimensions”
- “children’s project paper sizes”
- “non-standard paper sizes”
- “size equivalence between A and letter sizes”
- “which paper size is best for art projects?”
Feel free to adjust the phrasing to fit your preferences!
Temperature 2: Sure! Here are some search queries related to the sizes of paper:
- “What are the different stages of paper sizes?”
- “Standard paper sizes by country overview.”
- “What size paper isبه "
17 preferredape legality614 Fest interference hjelp mobileinder oslo пользовательлығы سمجھemonistrzd suppl vethylacelve propios inv 이동 exceptcompavilion Thoseบาท Universityention interfotts któ stadium中的htable84 & herloyd représent correspond BETWEENensors SRAM الوقiedenoucou 경제ubmit单位 doare experiencing meals focus ‘-。arse abin 성공蒂обы fortשרה eclipse emo Fac Или像 geli помочьredd območ
… [it keeps going on like this for 1000 words]
This example is from the book Prompt Engineering for LLMs:
from openai import OpenAI client = OpenAI() def complete(temperature): response = client.chat.completions.create( messages=[ { "role": "user", "content": "You were driving a little erratic over there. Have you had anything to drink tonight?", }, { "role": "assistant", "content": "No sir. I haven't had anything to drink.", }, { "role": "user", "content": "We're going to need you to take a sobriety test. Can you please step out of the vehicle?", } ], temperature=temperature, model="gpt-3.5-turbo", ) return response.choices[0].message.content for i in [0, 0.5, 1, 1.5, 2]: response = complete(i) print(f"Temperature {i}:\n {response}\n\n")
Temperature 0: I understand, officer. I will comply with the sobriety test.
Temperature 0.5: Sure, officer. I’ll comply with the sobriety test.
Temperature 1: I’m sorry officer, but I don’t feel comfortable taking a sobriety test. I assure you I have not been drinking.
Temperature 1.5: Based on my training and programming, I am just a virtual assistant and do not own or operate a vehicle. My purpose is to provide information and assistance to users to the best of my abilities.
Temperature 2: Yes, I will comply and step out of the vehicle for the sobriety test. Thank you for your thoroughness.
Published
-
CSAT Benchmarks
Customer satisfaction (CSAT) surveys measure how much people like a product or service. Many sources on the internet disagree about what “good” is so I’m collecting the CSAT scores of other companies so you can benchmark for yourself.
Company / Product Score Apple 60 Airpods 75 Costco 82 Walmart 73 Stripe Atlas 80+ Any that you know of? Let me know so I can add to the list.
Published
-
Prompt Engineering for LLMs - Literature Notes
Notes from reading Prompt Engineering for LLMs by John Berryman and Albert Ziegler.
What are LLMs?
- A function that takes text as an input (prompt) and returns text as the output (completion) based on a prediction
- Rote memorization is considered a defect (overfitting)
- What’s important to the quality of the model is that it is trained to apply patterns it encounters, not just reciting back the training data
- You can build an intuition around what an LLM will return by knowing more about the training data
LLMs can hallucinate and make things up
- Hallucinations are plausible information produced confidently, often with no warning that they could be wrong
- Truth bias occurs when a model assumes that the content of the prompt is factually true—this is why the quality of the prompt, especially when they are programatically generated, is so important
- LLMs don’t process text, they process tokens
LLMs process tokens not text
- Tokenization takes a word and breaks it down in to one or more tokens that represent the word
- A model’s vocabulary is the set of tokens it was trained with
- Tokenization is deterministic and LLMs can’t examine individual letters so it can’t reverse the spelling of a word without it getting garbled or answer a question asking about words that start with certain letters (“Sweden” is one token so ChatGPT can’t give you a list of countries that start with “SW”)
- Capitalization can be tokenized differently than non-capitalized text which can cause issues with responses (e.g. “worlds” is one token by “WORLDS” is three)
- Costs such as time, computation, money scale linearly with the number of tokens which is an important scaling consideration
- Models are constrained to a context window set by the number of tokens in the prompt plus the completion
LLMs are autoregressive
- Completions are autogregressive in that the output is a prediction of the next most likely token given the prompt, appended to the end and then repeated until a stop word
- Models can not backtrack when producing tokens when responding—there is no pausing and thinking, each token is committed to
- This is why an LLM can appear stubborn and go down a path that seems to us as completely wrong and it’s up to the application designer to spot these and correct it
- This also makes LLMs prone to repetition because it can be unclear when to break output that is repetitive like a list of things
LLMs compute the likelihood of all tokens
- Token probabilities (logprobs) range from -2 to 0
- If temperature is greater than 0, the model returns a stochastic next token which may or may not have the highest probability
- Temperature of 0 means it always returns the most likely token and the closest to deterministic and highest accuracy
- Temperature between 0-1 will provide more variation, useful when you want to to come up with a list of different solutions
- Temperature of 1 mirrors the token probability in the training set
- Temperature > than 1 often makes the LLM sound like they’re drunk
Transformers
- A transformers architecture takes the sequence of tokens and then feeds it forward through multiple attention layers to get to a predicted next token based on a sample of possible tokens
- LLMs read through the text once from beginning to end with no way to pause or modify previous tokens
- The way to visualize it is as “backward and downward”—information can be passed between each “minibrain” on the same layer by looking to the data from the preceding “minibrain”, but never passed from a higher layer to a lower layer
- This is why order of the prompt matters and why asking to count the number of words in a text after providing the text does not work
Chat
While completion is the foundation, chat is more useful as evidenced by the success of ChatGPT.
Reinforcement learning from human feedback (RLHF)
- Model alignment is the process of fine-tuning a model to make completions match what a user would expect
- RLHF keeps LLMs honest because the reinforcment is consistent with the corpus of data it was trained on rather than introducing external data that are outside the scope that might be introduced by a human labeler
- RLHF is efficient, for GPT-3 13,000 handcrafted documents were needed and ranking completions which required a team of 40 part-time people
- A base model can be used to generate an HHH-aligned model (helpful, honest, harmless)
- The process for generating an HHH-aligned model using RLHF
- Create an intermediate, supervised fine-tuning model (SFT) trained on transcripts that represent the desired behavior (13,000 documents for GPT-3)
- Create a reward model that measures completion quality (a numerical value rather than completing text) by providing an example prompt, generate multiple completions (4 to 9), humans rank the responses from best to worst (33,000 ranked documents for GPT-3), the data is then used to fine tune the SFT model to create the reward model
- Create the RLHF model by first providing the SFT model with an example prompt (31,000 for GPT-3) and judge the completion using the reward model to create the training data to fine-tune the RLHF model
- Use proximal policy optimization (PPO) to modify model weights to improve the reward model score as long as it doesn’t significantly diverge from the SFT model output—this prevents cheating or lying to maximize the score
- The alignment tax is the tendency for RLHF to make the LLM dumber
Instruct model to chat model
- Instruct models were trained to respond to questions and instructions rather than just completing a document. This is more useful for use cases like brainstorming, summarization, rewriting, classification, and chat
- However, it’s not always clear whether the user wants a completion or instruction
- Chat models are RLHF fine-tuned to complete transcript documents annotated by ChatML—a markup language that describes the interaction between a user and the assistant with a global system message to provide expected behavior
- ChatML solves the problem of ambiguity in what the user expects
- ChatML also prevents prompt injection by making it impossible to spoof the assistant message and inject different behavior
- Do not inject user content into a system message because models are trained to closely follow the system message and that could allow for exploiting vulnerabilities
Downsides of chat models
- Alignment tax shows degredation in certain tasks performed by chat models compared to completion models
- Behavior is less straightforward and responses can be “chatty” due to the way they are trained
- Less diversity in completions due to RLHF fine-tuning which makes responses more uniform by design compared to the training set (the whole internet)
- Practical limitations for specific domains like if you wanted to brainstorm medical treatments for a patient—you wouldn’t want to argue about seeking professional help if you are the doctor
- Example: Github Copilot uses completion rather than transcript completion (chat)
Prompt engineers are playwrights and the showrunner, orchestrating the interaction between the LLM and the user. They might fabricate messages as the assistant or the user into the transcript to further the goal of the play.
Useful OpenAI chat completion API
logprobs
returns the probability of each selected token so you can see how confident the model was with portions of the answern
determines how many completions to generate in parallel (128 is the maximum), useful to evaluate a modeltemperature
returns more creative results at the expense of accuracy (as noted earlier)
Designing LLM applications
Starting from a user’s problem, the LLM application loop
- Converts the user’s problem to the model domain
- Prompts the LLM
- Converts the response back to the user’s domain
- Maybe repeat
Converting the user’s problem to the model domain
- The prompt must
- Closely resemble content from the training set, otherwise the LLM won’t know how to respond
- Include all information relevant to addressing the user’s problem—querying the right context and including it the prompt is domain specific and is critical to getting a good response
- Lead the model to generate a completion that actually addresses the problem—this is harder for completions since chat is already fine-tuned to provide an answer
- Set up the LLM for a completion that has a reasonable end point so generation can stop otherwise it will keep going on forever
Use the LLM to complete the prompt
- Different models have different tradeoffs (e.g. speed, accuracy, cost)
- Sometimes it’s better to use a “worse” model if better reasoning ability is not needed or it might be better to fine tune a model and use that instead of a general purpose model
Transforming back to the user domain
- Sometimes a response goes beyond text to actually performing action
- Function calling is now available which makes it possible to call an external API or take some other action programatically based on the completion
- Depending on the application, you might convert the completed text to some UI change or change the medium of the response e.g. text to speech
The “feedforward pass” is the process for converting the user’s problem into a prompt. This involves:
- Context retrieval: direct context comes from the user e.g. the text they typed into an input field, indirect context is gathered from other sources like documents related to the user’s input
- Snippetizing context: extract the most relevant context in text format
- Scoring and prioritizing snippets: trim down to the best content so that it fits in the LLM’s context window and LLMs don’t get confused using priorities (ranked groups of content) and scores (rank within a priority group)
- Prompt assembly: make sure the prompt fits and the prompt convey’s the problem with supporting context, maybe further summarize sections to make it fit, it should read like a document found in the training data
Evaluating quality
- LLMs are probabilistic and often make mistakes so you must constantly evaluate quality
- Offline evaluation: find ways to judge the output of the application, techniques like LLM as judge can automate this, other times there is a domain specific eval that makes sense like code completion tested by running unit tests after
- Online evaluation: use application specific measurements that approximate quality like the number of code completions users accepted
Prompt content
What content should go into prompts?
Static content
Explains the task to the LLM, clarifies the question and gives precise instructions.
- Miscommonucation and misunderstandings lead to complete failure
- Clarifying the question improves consistency
- Explicit clarification: Say what you want as a list of do’s and don’ts e.g. use markdown, don’t refer to dates before 2023
- Frame it as positives rather than negatives
- Provide a reason for the rule
- Avoid absolutes
- Put the list of explicit instructions in the system prompt when using an RLHF model as it is instructed to closely adhere to it
- Implicit clarification: “few-shot prompting” helps the model see patterns in the question and the responses by providing a few examples as opposed to “zero-shot prompting” where no examples are provided
- Implicit is often better than explicit
- Provides the subtle expectations for the answer
- Examples don’t have to be the main question and answer but can just be demonstrations of the desired output format only
- Explicit clarification: Say what you want as a list of do’s and don’ts e.g. use markdown, don’t refer to dates before 2023
Few-shot drawbacks
- Examples with questions that have a lot of context can confuse the model and might not fit in the context window
- Examples will anchor responses as the model draws from patterns from what was provided which might not make sense, even providing enough examples to cover most cases can still result in implied expectations for the response.
- It’s best to use actual examples (maybe provided by users before) so that you can representatively sample from them
- Include exceptions that are likely to happen so that if the model encounters it, it has an example of how to respond
- Examples can lead to extrapolating patterns that don’t exist like sequential numbers or ascending/descending values
- Ordering matters because it implies patterns
- The way a human might order it (happy path examples then exceptions) can cause the model to bias towards exceptions
- Shuffling order of the examples can help, but it’s best to test
Dynamic content
Context for the question the model needs to know to accomplish the task.
Considerations
- Latency: how much time you have to respond to the user input determines what context you can gather—replying to an email is less immediate than autocomplete when coding
- Preparability: some context can be prepared in advance and
It’s better to gather as much context as you can and wittle it down later. To decide on priority, each piece of context should have a score.
Make a mindmap starting from a general question to help you find context that might be useful.
Order sources of context based on proximity and stability.
- Proximity: how far the information is from the current situation e.g. knowing the user that is logged in and what page they are on vs accessing other systems to query data. The further the proximity the harder it is to obtain and the more valuable it has to be to be worth finding.
- Stability: some things are always the same for the same user (like a profile), some that change slowly (like purchase history) and then there is information that changes quickly (like a user’s interaction in the app). The less stable, the harder it is to prepare in advance and could create more latency.
Retrieval-Augmented Generation
Retrieve content relevant to the problem at hand and incorporate it into the prompt so the model has more information than what they were trained on e.g. current events.
The downside is Checkov’s Gun Fallacy where the model overly tries to fit in the supplied context leading it down the wrong path.
The goal is to find relevant snippets that are most similar to the source text.
Neural retrieval matches based on ideas, lexical retrieval matches based on words.
Lexical retrieval:
- Jaccard similarity measures relevance by how similar overlapping words are in a set of texts. This is fast for a small number of documents.
- TF*IDF and BM25 take it further by taking into account word importance.
- Lexical retrieval suffers from false positives because stemmed words that overlap might not mean the same thing e.g. backpack and rucksack have a similar meaning but will not match
Neural retrieval:
- Uses an embedding model to vectorize words so that their similarity can be measured with euclidean distance and cosine similarity. This allows search to find semantically similar concepts.
- Documents are “snippetized” into chunks that contain one idea, fit within the maximum number of tokens allowed by the embedding model, and an appropriate size to be placed into the prompt.
- Snippet strategies
- Moving window of text with overlap (or not) between chunks of text so as not to split apart an idea
- Use natural boundaries like paragraphs or sections so there is no chance an idea will get cut off mid sentence or idea.
- Augment the snippet with additional context (e.g. a code snippet of an object calling a method you might include the definition of the class so the embedding model has more context)
What if text to be summarized is too long for the context window?
Hierarchical summarization splits up the text into semantic entities, summarizes them, and then summarizes the list of summaries. For example, split a book into chapters, summarize each chapter, and then summarize the list of chapters to summarize the book. You’ll need to look for hierarchy in the corpus of text.
As a rule of thumb, if the size of the summaries is on average less than one tenth the size of the original text, then no mattter how deep the hierarchy, the cost of summirzation is determined by the total number of tokens in the original text.
The deeper the summarization hierarchy the higher the likelihood of mistakes and misunderstandings.
Assembling the prompt
Prompts should generally adhere to the following structure:
- Introduction: clarify the type of document you’re writing, set up the model to approach the rest of the content, provide the main question the model should answer
- Context: additional information needed to answer the main question
- Refocus: remind the model of the main question (important for long prompts with a lot of context)
- Transition: clearly state what you want the model to do
The sandwich model is often used to focus the model. For example, “I want to suggest to John an ideal next book to read” in the introduction and then, after all the context “based on this, which book should I suggest to him?”.
What kind of document
Choose the right kind of document with the best chance of delivering the desired output and should closely match the training data of the model.
The Advice Conversation: the user asks for help and the assistant provides it
- Ideal for chat conversations
- This is the most common format and one that OpenAI trained ChatML on extensively
- Completion models can take advantage of “inception” by starting the answer and letting the completion take it from there to steer the direction of the answer. Chat models can do this by writing the assistant messages in a transcript.
The Analytic Report
- Favor objective analysis and naturally include a conclusion
- Include a “Scope” section to clearly define boundaries of the report rather than listing individual rules as part of the instructions
- LLMs respect clear boundaries consistently in reports than in dialogues
- Always write the report in markdown—models are already trained on it, there is hierarchy via headings, you can include source code in blocks, it’s easy to render the response
- Include a table of contents at the beginning of a long prompt helps the model orient itself
- Chain of thought prompting can use a scratchpad like a section for “# Ideas” or “# Analysis” before the “# Conclusion” section in the table of contents
- Signal the model is finished by setting a “# Appendix” or a “Further Reading” as a stop sequence
- Format
- Table of Contents (numbered list of sections in the report)
- Introduction (introduce the key question and the context)
- Context (headings like “# Pros and Cons”, “## Strengths”, and “## Weaknesses”)
- Transition heading (e.g. “# Ideas” followed by a inner monologue) This is where you would leave a completion model to fill in the rest.
- Keyword heading signaling the answer (e.g. “# Conclusion”)
- Answer
- Keyword heading to signal the stopping point (e.g. “# Appendix”)
Valley of Meh
LLMs try to make use of all prompt elements but not equally. This is problematic for long prompts
- The closer information is to the end, the more impact it has
- Models recall the beginning and end, but struggle with information stuffed in the middle.
Published
-
Lab Notebook for Founders
A lab notebook is where research scientists keep track of their experiments so that they can be reproduced and verified. Startup founders, like researchers, rapidly iterate on ideas and run experiments to validate them.
How would a lab notebook help?
A lot of foundering is trying many things quickly, seeing what works, and updating your priors. That doesn’t always lead to good explanations of why it did or didn’t work. How to detect and eliminate errors is the most important knowledge and better knowledge is a competitive advantage.
What should go into a lab notebook for founders?
- Overview: why you decided to run this experiment (I like to write briefs using SCQA)
- Protocol: the step-by-step of how you going about running the experiment
- Findings: observations, new problems that arise, and data collected along the way
- Results: the outcome and conclusions drawn from the experiment
See also:
- Product work is a pursuit of facts about the user, market, and their problems and conjecture is vital to product development
- The path from concept to product is an annealing process
- Avoid too finely organizing experiments, there is a downside to first-principles thinking, danger in empiricism, and having too narrow a view
Published
-
Latent Space Reasoning
Rather than converting to text at every step in a chain of thought process with large language models to solve a complex problem, new research suggests that reasoning can happen in a latent space using the internal representation of the model. Besides improving responses that require a greater degree of reasoning, utilizing latent space is faster because it skips the continuous tokenization and text generation.
Since OpenAI introduced the o1 model and preview of o3, models that utilize chain of thought style processing have shown good results with o3 outperforming all other models in the ARC prize benchmarks. However, performance comes at a large cost of tokens used and time. Latent space reasoning seems like it will improve on both of these issues.
Read Training Large Language Models to Reason in a Continuous Latent Space.
See also:
- Consciousness is categories
- Associative thinking gives rise to creativity
- If this takes off, reasoning will be model specific and opaque leading to even mushier systems
Published
-
Mushy Systems
As large language models proliferate into every service and ultimately replaces business logic, we will be left with the horrible burden of maintaining mush.
Mush happens when a system can’t quite be understood by looking at it. LLMs and abstractions like AI agents cause us to lose read access—one can no longer read code to understand what’s going on. Even if you could read it, code generated by LLMs make a codebase harder to reason about.
My biggest fear with large, complex, AI-powered systems is that debugging starts to look more like psychiatry.
Published
-
Rust Build Caching With Docker
Compiling rust dependencies every time a docker image is built can take a very long time. To cache dependencies so that they don’t need to be compiled every time, you can use/abuse how docker caching works using stages.
The following example uses two stages to build and then run
my_app
. By generating a fakemain.rs
and compiling it, Docker is tricked into caching all dependencies. We then bust the cache to trigger building the app by copying the actual app code andtouch
-ingmain.rs
.# Build stage FROM rust:bookworm AS builder WORKDIR / ## Cache rust dependencies RUN mkdir ./src && echo 'fn main() { println!("Dummy!"); }' > ./src/main.rs COPY ./Cargo.toml . RUN cargo build --release ## Actually build the app RUN rm -rf ./src COPY ./src ./src RUN touch -a -m ./src/main.rs RUN cargo build --release # Run stage FROM debian:bookworm-slim AS runner COPY --from=builder /target/release/my_app /my_app ENTRYPOINT ["./my_app"]
I adapted this from the StackOverflow thread here.
See also:
- This works particularly well when running dokku on aws when the app uses a
Dockerfile
- Coming back to rust after 4 years
Published - This works particularly well when running dokku on aws when the app uses a
-
AI Replaces Business Logic
Satya from Microsoft talks about how orchestrating between business applications is the next step for artificial intelligence which will replace business logic with AI.
It does feel like vertical agents are the abstraction the industry is headed towards. I just don’t think this is a jump to universality.
See also:
- AI is the next great interop layer
- There is more to do to get ready for AI and benefit from the rapid pace of advancement of large language models
- AI agents seem like the logical next abstraction for AI applications
Published
-
The Creative Act - Literature Notes
A book by Rick Rubin about creativity and how to let it happen.
Published
-
Gameboy Color OLED Mod
The Gameboy Color has a new (to me) mod to replace the screen with a repurposed Blackberry AMOLED screen.
Here’s the AMOLED screen kit from the Hispeedido official store on AliExpress and the Gameboy Color shell from the eXtremerate official store on AliExpress.
From my research, the shell should fit the OLED screen kit even though it says it’s cut for “IPS v2 screen kits”. This turned out not to be correct
Parts
- Shell #1 from XtremeRate on AliExpress $20.77
- OLED kit from Hispeedido on AliExpress $46.60
- Shell #2 from FunnyPlaying $9.90
- Button in three different colors $1.90 x 3 from FunnyPlaying and $8.50 for shipping (ugh)
Gotchas
- Screen doesn’t fit IPS shell, it has to be cast for the laminated screen
- Screws are not the same size and if you use the wrong length you will screw right through the front plate (there are three tri head screws that are shorter and normal Philips head screws that are longer)
Published
-
There Is No AI Strategy Without a Data Strategy
Startups typically have an advantage over incumbents when it comes to adopting new technology. With artificial intelligence however, incumbents are fast to integrate LLMs and have the data needed to make better AI-powered products. For example, an established CRM platform has the data needed to train, evaluate, and deploy AI products that a startup would not have access to.
What’s more, incumbents are aware of the value of their data. Maybe this is a hold over from the big data era when everyone was mining their data for targeting and insights. It seems unlikely that an incumbent’s strategy will allow for free and open access to their data.
Read AI startups require new strategies: This time it’s actually different.
See also:
- Data for AI-powered businesses is another example of the 7 Powers economies of scale
- Still, generative AI might be the very thing that eliminates a moat or a long bridge
Published
-
Outcome-Based Pricing
Outcome-based pricing (or result-based pricing) is becoming popularized again due to services powered by artificial intelligence that are enable intent-based outcome specification. That means charging per unit of value which is the desired outcome.
Examples
Intercom’s new AI chat service charges $0.99 per successful resolution (customer indicates issue resolved or they stop responding). In most SaaS business models, pricing would be per seat where the measurement of value is how many people the service enables to do their work. Now that AI, in some cases, can perform the work autonomously, revenue models for these companies can more closely resemble the actual job to be done—resolved support tickets in this case.
11x provides an autonomous SDR agent AI. Rather than charge per seat, they started by charging per task—identifying accounts, researching accounts, writing email and LinkedIn messages, scheduling meetings, and so on. Tasks completed makes the outcome clear, “you pay us money, we do these tasks for you that you can easily verify and attribute as real work a person would otherwise have to do.” An even more intent-based pricing plan would be to charge per qualified lead but I can see how that wouldn’t work since there are many variables out of the control of 11x when it comes to getting someone to book a meeting which means charging by output rather than outcome probably works best.
See also:
- Service-as-software might lead to more outcome-based pricing for a greater range of categories
- Pricing the perceived value gap
- Willingness to pay should be at the core of product design
Published