<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Vikram Oberoi]]></title><description><![CDATA[Hi! I'm Vikram. I've been an engineer and exec at early-stage startups for over a decade. I write here about work I do and topics I explore.]]></description><link>https://www.vikramoberoi.com/</link><image><url>https://www.vikramoberoi.com/favicon.png</url><title>Vikram Oberoi</title><link>https://www.vikramoberoi.com/</link></image><generator>Ghost 5.79</generator><lastBuildDate>Thu, 22 Feb 2024 02:18:04 GMT</lastBuildDate><atom:link href="https://www.vikramoberoi.com/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[Why you should regularly and systematically evaluate your LLM results]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://katiebcartoons.com/media/cartoon_variant_files/alas_poor_yorick.png" class="kg-image" alt="In this play on Shakespeare&apos;s Hamlet, we see Hamlet played by a robot, holding up Yorick&apos;s skull. He says, &quot;Alas, poor Yorick! My facial recognition technology would know him anywhere!&quot; " loading="lazy" width="3840" height="3840"><figcaption><span style="white-space: pre-wrap;">This robot would have known this is just a random skull had anyone spent time evaluating results from the Yorick Identification Model. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com" rel="noreferrer"><span style="white-space: pre-wrap;">katiebcartoons</span></a><span style="white-space: pre-wrap;">.</span></figcaption></figure><p>Evaluating results from LLM pipelines is time-consuming and often I&apos;d rather poke my eye with stick instead of looking at another 100 results to</p>]]></description><link>https://www.vikramoberoi.com/why-you-should-regularly-and-systematically-evaluate-your-llm-results/</link><guid isPermaLink="false">65aaedb1e3b88f0001de7536</guid><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Sat, 03 Feb 2024 22:00:04 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2024/02/evaltool.jpeg" medium="image"/><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://katiebcartoons.com/media/cartoon_variant_files/alas_poor_yorick.png" class="kg-image" alt="Why you should regularly and systematically evaluate your LLM results" loading="lazy" width="3840" height="3840"><figcaption><span style="white-space: pre-wrap;">This robot would have known this is just a random skull had anyone spent time evaluating results from the Yorick Identification Model. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com" rel="noreferrer"><span style="white-space: pre-wrap;">katiebcartoons</span></a><span style="white-space: pre-wrap;">.</span></figcaption></figure><img src="https://www.vikramoberoi.com/content/images/2024/02/evaltool.jpeg" alt="Why you should regularly and systematically evaluate your LLM results"><p>Evaluating results from LLM pipelines is time-consuming and often I&apos;d rather poke my eye with stick instead of looking at another 100 results to decide if they are &quot;good&quot;.</p><p>But I do it anyway <strong>and I cannot understate how beneficial it is.</strong></p><p>Here&apos;s why you should do it:</p><ol><li>You&apos;ll know when your changes are working.</li><li>You&apos;ll improve your LLM pipelines faster.</li><li>You&apos;ll build a fine-tuning dataset along the way.</li><li>You&apos;ll build better intuition for what does and doesn&apos;t work well with LLMs.</li></ol><p>These benefits compound if you evaluate LLM results regularly and systematically.</p><p>By <em>regularly</em>, I mean every time you make a change to your LLM pipeline. A change could be a new prompt, chunking strategy, model configuration, etc.</p><p>By <em>systematically</em> I mean that you carefully examine your results, capture how good or bad they are, and why.</p><p>If you are not already doing this, picture me grabbing you by the shoulders and screaming &quot;WHY NOT?!!&quot;</p><p>Then read the rest of this post, which is a more constructive attempt at persuading you to evaluate your LLM responses.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://pbs.twimg.com/media/GBwHLbNXQAA6dpk?format=jpg&amp;name=4096x4096" class="kg-image" alt="Why you should regularly and systematically evaluate your LLM results" loading="lazy" width="3520" height="1980"><figcaption><span style="white-space: pre-wrap;">This evaluation tool took an afternoon to cobble together and it was game changing. Shown here is an early attempt at identifying speakers in city council meeting transcripts.</span></figcaption></figure><p>In the course of building the <a href="https://www.vikramoberoi.com/a-ux-centric-approach-to-navigating-city-council-hearings-with-llms/" rel="noreferrer">city council meeting tool described here</a>, I spent many days trying to get <code>gpt-4-turbo</code> to accurately identify speakers in 4-hour meeting transcripts with upwards of 50 speakers.</p><p>I started with a simple approach: feed the entire transcript into <code>gpt-4-turbo</code> with a prompt. The problem is that <code>gpt-4-turbo</code> performs tasks quite poorly on long context chunks (~40K - 100K tokens). <a href="https://www.vikramoberoi.com/a-ux-centric-approach-to-navigating-city-council-hearings-with-llms/" rel="noreferrer">I talk briefly about the issue in this post.</a></p><p>So I pursued a new strategy: slice my transcript into manageable chunks and try identifying speakers in each chunk instead.</p><p>I spent a day or two feeling around in the dark for the right chunking strategy and prompt without getting great results. I eventually asked Gabriel (my partner at <a href="https://baxterhq.com/?ref=vikramoberoi.com" rel="noreferrer">Baxter</a>) if he&apos;d be willing to cobble together a UI to help me evaluate my results quickly.</p><p>It took a thirty minute conversation, an afternoon of his time, and it was a complete game-changer for me.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2024/02/image-1.png" class="kg-image" alt="Why you should regularly and systematically evaluate your LLM results" loading="lazy" width="1442" height="1648" srcset="https://www.vikramoberoi.com/content/images/size/w600/2024/02/image-1.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2024/02/image-1.png 1000w, https://www.vikramoberoi.com/content/images/2024/02/image-1.png 1442w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A revelation! </span><a href="https://x.com/voberoi/status/1737993435225665567?s=20&amp;ref=vikramoberoi.com" rel="noreferrer"><span style="white-space: pre-wrap;">Here is a link to the tweet.</span></a></figcaption></figure><p>I improved the performance of my approach dramatically over the course of a couple of days: my speaker identification accuracy went from 35-50% on Tuesday to 80-90% on Thursday.</p><p>To get to that point, I evaluated hundreds of responses and changed my prompt and chunking strategy to address specific classes of problems I observed.</p><p>Here are some problems I observed and changes I made to address them that improved speaker identification accuracy:</p><ul><li>The LLM responses would heavily weight self-introductions, like &quot;I&apos;m council member Brooks-Powers.&quot;, and fail to draw conclusions from sentences like &quot;I&apos;d like to pass the mic to council member Restler.&quot;<ul><li><strong>Solution: </strong>I added directions and examples showing how to identify a speaker based on neighboring context. </li></ul></li><li>Mistranscriptions would cause issues: &quot;Jose&quot; instead of &quot;Oss&#xE9;&quot;, &quot;Adrian&quot; instead of &quot;Adrienne&quot;. <ul><li><strong>Solution: </strong>I provided a list of council member names and added directions to infer mistranscriptions. The LLM starting figuring out when a name was a likely mistranscription and got many more inferences correct.</li><li>(If you&apos;re interested in learning about other ways to handle mistranscriptions, <a href="https://www.vikramoberoi.com/using-metaphone-to-handle-bad-transcriptions-in-wine-voice-search-for-sommeliers/" rel="noreferrer">I wrote this post about a voice search use case involving an Italian restaurant and sommeliers.</a>)</li></ul></li><li>Because roll calls happen so fast, <a href="https://en.wikipedia.org/wiki/Speaker_diarisation?ref=vikramoberoi.com" rel="noreferrer">speaker diarization</a> (identifying distinct speakers in audio and giving them a label) fails badly. As such, it is not reliable to infer a speaker&apos;s identity based on them saying &quot;Present&quot; in a roll call. (The image showing the eval tool above shows an example of this problem.)<ul><li><strong>Solution:</strong> I added examples to deter the LLM from doing this in my prompt.</li></ul></li><li>In a few cases, the LLM hallucinated a council member&apos;s district and got the wrong answer: &quot;Adams is part of District X but the speaker said District Y, so I must infer that it is not Adams.&quot;, when actually it was Adams.<ul><li><strong>Solution: </strong>I added the district and borough to my council member list.</li></ul></li><li>I noted that my chunks didn&apos;t always have the context required to identify a speaker.<ul><li><strong>Solution:</strong> I modified my chunking strategy until they did.</li></ul></li></ul><p>I did this over and over again for many classes of errors, evaluating hundreds of responses until I consistently managed to get an 80-90% accuracy rate.</p><p>These changes were all easy to make, but I didn&apos;t know to make them until I systematically reviewed my inputs and outputs.<strong> </strong>Once I did, getting better performance from my LLM pipeline took very little time.</p><p>As a bonus, I&apos;ve stashed away every evaluated response so that when it makes sense for me to fine-tune a model for my task, I already have the dataset to do it.</p><p><a href="https://www.saxifrage.xyz/post/ai-wrapper?ref=vikramoberoi.com" rel="noreferrer">This post shares that ~2K data points is when a fine-tuned <code>gpt-3.5-turbo</code> begins performing on par with an optimized prompt.</a></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2024/02/Screenshot-2024-02-03-at-4.37.00-PM.png" class="kg-image" alt="Why you should regularly and systematically evaluate your LLM results" loading="lazy" width="1130" height="956" srcset="https://www.vikramoberoi.com/content/images/size/w600/2024/02/Screenshot-2024-02-03-at-4.37.00-PM.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2024/02/Screenshot-2024-02-03-at-4.37.00-PM.png 1000w, https://www.vikramoberoi.com/content/images/2024/02/Screenshot-2024-02-03-at-4.37.00-PM.png 1130w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">I&apos;m not going to lie: it is boring to evaluating hundreds of LLM responses. But you will have better outcomes and become a better practitioner when you do.</span></figcaption></figure><p>The last thing that evaluating my LLM results systematically has helped me with: I&apos;ve built useful intuition around what a good prompt needs to have and what improvements are worth making now vs. later.</p><p>Today, my initial attempts at prompting LLMs work more effectively and my improvements are swift and significant. I know that a good, diverse set of few-shot examples is the highest leverage way to improve a prompt so I race to get good examples, fast.</p><p>My guesses around how I should I should chunk lengthy passages are more accurate, and I have a sense for how big those chunks should be (4-6K tokens is an anecdotal sweet spot, but up to 10K can work well in many cases).</p><p>I&apos;ve picked up tricks and techniques that I use constantly, like <a href="https://x.com/voberoi/status/1739756792731619548?s=20&amp;ref=vikramoberoi.com" rel="noreferrer">this one about making up your own markup language to prevent hallucinations</a>, and using chain-of-thought not only to get better answers, but to observe why the LLM did what it did so I know how it screwed up or, occasionally, how it made a surprising and accurate association.</p><p>All this is to say: if the performance of your LLM results matters for your use case, you should be systematically evaluating them. And I mean you, specifically: the person implementing the feature that relies on LLM output.</p><p>Scaling requires you to outsource this work, and you might need to eventually.</p><p>But you can get very far doing this work on your own, you will learn a ton, and you will gain important context and control over the quality of your LLM-powered feature by doing it yourself.</p><p>So do it, please!</p><p>(Or tell me why this is a pointless exercise on <a href="https://twitter.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a>, <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>, or by sending me an email at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com.</a>)</p>]]></content:encoded></item><item><title><![CDATA[A UX-centric approach to navigating city council hearings with LLMs]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2024/01/Xnapper-2024-01-19-17.04.46-1.png" class="kg-image" alt="A screenshot of a prototype of a tool that makes it easier to navigate city council hearings." loading="lazy" width="2000" height="1125" srcset="https://www.vikramoberoi.com/content/images/size/w600/2024/01/Xnapper-2024-01-19-17.04.46-1.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2024/01/Xnapper-2024-01-19-17.04.46-1.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2024/01/Xnapper-2024-01-19-17.04.46-1.png 1600w, https://www.vikramoberoi.com/content/images/2024/01/Xnapper-2024-01-19-17.04.46-1.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A prototype that makes it easier to navigate and research city council hearings using LLMs (links below).</span></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">&#x1F449;</div><div class="kg-callout-text">This blog post discusses a prototype I worked on which is now live at <a href="https://citymeetings.nyc/?ref=vikramoberoi.com" rel="noreferrer">https://citymeetings.nyc</a>.<br><br>citymeetings.nyc is a tool built to help folks navigate NYC city meetings easily.<br><br><b><strong style="white-space: pre-wrap;">Note:</strong></b></div></div>]]></description><link>https://www.vikramoberoi.com/a-ux-centric-approach-to-navigating-city-council-hearings-with-llms/</link><guid isPermaLink="false">65aaa1c6e3b88f0001de7403</guid><category><![CDATA[llm]]></category><category><![CDATA[city council]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Fri, 19 Jan 2024 22:33:00 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2024/01/Xnapper-2024-01-19-17.04.46-4.png" medium="image"/><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2024/01/Xnapper-2024-01-19-17.04.46-1.png" class="kg-image" alt="A UX-centric approach to navigating city council hearings with LLMs" loading="lazy" width="2000" height="1125" srcset="https://www.vikramoberoi.com/content/images/size/w600/2024/01/Xnapper-2024-01-19-17.04.46-1.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2024/01/Xnapper-2024-01-19-17.04.46-1.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2024/01/Xnapper-2024-01-19-17.04.46-1.png 1600w, https://www.vikramoberoi.com/content/images/2024/01/Xnapper-2024-01-19-17.04.46-1.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A prototype that makes it easier to navigate and research city council hearings using LLMs (links below).</span></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-blue"><div class="kg-callout-emoji">&#x1F449;</div><div class="kg-callout-text">This blog post discusses a prototype I worked on which is now live at <a href="https://citymeetings.nyc/?ref=vikramoberoi.com" rel="noreferrer">https://citymeetings.nyc</a>.<br><br>citymeetings.nyc is a tool built to help folks navigate NYC city meetings easily.<br><br><b><strong style="white-space: pre-wrap;">Note:</strong></b> It is not currently designed for mobile.</div></div><img src="https://www.vikramoberoi.com/content/images/2024/01/Xnapper-2024-01-19-17.04.46-4.png" alt="A UX-centric approach to navigating city council hearings with LLMs"><p>I&apos;ve been using LLMs to glean information from NYC city council meetings as part of my work writing <a href="https://buttondown.email/voberoi?ref=vikramoberoi.com" rel="noreferrer">this newsletter on NYC city council activity</a>.</p><p>For a lot of this work I write prompts and pipe text from bills, memos, and transcripts into Simon Willison&apos;s <a href="https://llm.datasette.io/en/stable/?ref=vikramoberoi.com" rel="noreferrer"><code>llm</code> tool</a>. It is incredibly handy.</p><p>Where this approach fails is when I&apos;m using LLMs to navigate long council meetings (and, unfortunately, most council meetings are long).</p><p>My goal is to find things that are &quot;notable&quot; in these meetings so I can write about them. Notability is hard to define, but here are some things that might fit my criteria:</p><ul><li>New facts that emerge from hearings.</li><li>Contentious exchanges between council members and a city agency.</li><li>Surprising testimonies from entities in NYC.</li><li>Passionate statements by council members about hot-button topics.</li><li>Things the public does not know about, but might find interesting.</li></ul><p>Transcripts for these meetings are regularly 40k to 100K tokens, which <code>gpt-4-turbo</code>&apos;s context window fits comfortably. Unfortunately, I run into two problems:</p><ol><li><strong>My output is way less useful with long contexts</strong>, which is likely due to the <a href="https://arxiv.org/abs/2307.03172?ref=vikramoberoi.com" rel="noreferrer">lost-in-the-middle problem</a>: LLMs don&apos;t effectively use long contexts and disproportionately weight context at the beginning and end of inputs.</li><li><strong>Prompts are more effective when they have the right context and clear instructions</strong>, but  &quot;notability&quot; is hard to define and requires context around the NYC council&apos;s workings and NYC current events.</li></ol><p>I find myself having many conversations using the <a href="https://llm.datasette.io/en/stable/?ref=vikramoberoi.com" rel="noreferrer"><code>llm</code> CLI</a>, and then reading the raw transcript/watching the video to seek neighboring context and verify answers I get. It&apos;s not that effective: I still have to read a lot and I have to get lucky in my quest for notable bits of info.</p><p>What I&apos;ve <strong><em>really</em></strong> wanted for these meetings is an index that lets me browse a transcript quickly and jump to potentially high-signal segments. I want to skip procedural content, like roll calls, and jump straight to an exchange that starts with a council member&apos;s question. Or I want to browse all the public testimonies at the end and see who showed up to the hearing and why.</p><figure class="kg-card kg-image-card"><img src="https://www.vikramoberoi.com/content/images/2024/01/image.png" class="kg-image" alt="A UX-centric approach to navigating city council hearings with LLMs" loading="lazy" width="2000" height="1125" srcset="https://www.vikramoberoi.com/content/images/size/w600/2024/01/image.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2024/01/image.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2024/01/image.png 1600w, https://www.vikramoberoi.com/content/images/2024/01/image.png 2000w" sizes="(min-width: 720px) 720px"></figure><p><s>Here is a prototype that attempts this with two council meetings:</s></p><ul><li><s>A December 7th, 2023 hearing on the rat problem in NYC.</s></li><li><s>A December 20th, 2023 city council stated meeting</s></li></ul><p>Visit <a href="https://citymeetings.nyc/?ref=vikramoberoi.com" rel="noreferrer">citymeetings.nyc</a> to use the tool (<strong><em>it is not designed for mobile</em></strong>).</p><p>For these, I use GPT-4 to generate chapters and I worked with my parter at <a href="https://baxterhq.com/?ref=vikramoberoi.com" rel="noreferrer">Baxter</a> to glue it together in the UI you see above.</p><p>Using this tool, I can browse chapters and click on one to seek to that point in both the video and transcript. After clicking on a chapter, I can seek in a finer-grained way by clicking on a timestamp in the transcript.</p><p>That second feature is not AI, but it&apos;s important because it leaves a margin for error when LLMs inevitably mess up chapter boundaries. The chapters in the prototype need a lot of improvement, but this UX is still orders of magnitude better for research than:</p><ul><li>... piping 50K tokens into <a href="https://llm.datasette.io/en/stable/?ref=vikramoberoi.com" rel="noreferrer"><code>llm</code></a> and praying for something useful.</li><li>... reading a city council transcript.</li><li>... watching a city council meeting.</li></ul><p>I&apos;m wildly excited about research tools using LLMs because things like this are possible now that were not possible a year ago (GPT-4 only came out 10 months ago).</p><p>I&apos;m also excited about how these tools can be applied to repurpose virtually all audio and video content out there for personal use</p><p><a href="https://tomtunguz.com/?ref=vikramoberoi.com" rel="noreferrer">Tomasz Tunguz</a> linked to a video <a href="https://tomtunguz.com/the-missing-bschool-class/?ref=vikramoberoi.com" rel="noreferrer">in today&apos;s newsletter</a> that I was about to watch, but I immediately bounced when I saw it was 3 hours long. I will never watch <a href="https://www.youtube.com/watch?v=MtrkDoQFArU&amp;ref=vikramoberoi.com" rel="noreferrer">the video he linked to</a>, but I would have spent 10 minutes browsing it using the tool I built and probably gotten what Tomasz got out of it.</p><p>I&apos;m continuing work on this tool in service of <a href="https://buttondown.email/voberoi?ref=vikramoberoi.com" rel="noreferrer">my newsletter on city council activity</a> and because it is gobs of fun.</p><p>I&apos;ll be releasing it for the public when it&apos;s ready. <strong>(Update: It is available now at </strong><a href="https://citymeetings.nyc/?ref=vikramoberoi.com" rel="noreferrer"><strong>citymeetings.nyc</strong></a><strong>)</strong></p><p>I&apos;ll also be writing about specific challenges involved in things like speaker identification, improving chapter quality, extracting pull quotes, search (semantic and otherwise), and the realities of operating software like this.</p><hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading" style="color: #000000;"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading" style="color: #000000;"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success" style="color: #000000;">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" style="color: #000000;" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer" style="color: #000000;"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[What's in the NYC city council's public records?]]></title><description><![CDATA[How legislation makes its way through city council, the records they generate, what you can find in them, and how I use AI to sift through it.]]></description><link>https://www.vikramoberoi.com/whats-in-the-nyc-city-councils-public-records/</link><guid isPermaLink="false">65987b8ee3b88f0001de7368</guid><category><![CDATA[city council]]></category><category><![CDATA[llm]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Fri, 05 Jan 2024 22:07:52 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2024/01/febb9b03-264a-429b-a7e1-c99d040c04db-1.jpeg" medium="image"/><content:encoded><![CDATA[<img src="https://www.vikramoberoi.com/content/images/2024/01/febb9b03-264a-429b-a7e1-c99d040c04db-1.jpeg" alt="What&apos;s in the NYC city council&apos;s public records?"><p></p><div class="kg-card kg-callout-card kg-callout-card-red"><div class="kg-callout-text">This is a post I wrote for my newsletter, <a href="https://buttondown.email/voberoi?ref=vikramoberoi.com" rel="noreferrer">Keys to the City Council</a>, in which I cover NYC City Council legislative activity. In it, I talk about:<br><br>- How legislation snakes its way through city council.<br>- All the records the council generates along the way.<br>- What you can and can&apos;t find in them.<br>- How I use AI to sift through it with techniques you can use.</div></div><p>Every bill that makes it to a vote in the NYC city council passes.</p><p>(<a href="https://www.cityandstateny.com/politics/2022/05/new-york-city-council-failed-pass-bill-what-happened/367487/?ref=vikramoberoi.com">One bill didn&apos;t pass in 2022</a>. Before that, the last time a bill didn&apos;t pass was in the 90s.)</p><p>The council holds Stated Meetings twice a month to introduce new legislation and vote on bills, and the outcome of these meetings is determined by the time the agenda is up days in advance on <a href="https://legistar.council.nyc.gov/Calendar.aspx?ref=vikramoberoi.com">Legistar</a>, the council&apos;s public records system.</p><p>Stated Meetings are 90% procedural and 10% performative. Council members use whatever time they get on the floor to tout legislative achievements and explain &quot;nay&quot; votes they cast.</p><p>The meetings are predictable, -- everything passes! -- but the number of bills passed is surprising: the council passes dozens of bills every two weeks, passing over fifty at the last meeting.</p><p>Bills take anywhere from months to years to go from introduction to vote. <strong>Timelines exceeding one year are not unusual.</strong></p><figure class="kg-card kg-image-card"><img src="https://assets.buttondown.email/images/0a2cc446-6cc1-4730-b45a-5ba84a47c83d.png?w=960&amp;fit=max" class="kg-image" alt="What&apos;s in the NYC city council&apos;s public records?" loading="lazy"></figure><p><a href="https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=5755069&amp;GUID=B2A8438C-77D2-4AE1-9DB4-66AC11D37874&amp;Options=ID%7CText%7COther%7C&amp;Search=Int.+638&amp;ref=vikramoberoi.com">Introduction 638</a> from the NYPD transparency package passed two weeks ago.</p><p>It took <strong>1 year and 4 months</strong> to come to a vote.</p><p>At some point in 2022, <a href="https://council.nyc.gov/district-16/?ref=vikramoberoi.com">Councilmember Althea Stevens</a> decided to prioritize NYPD transparency and her office worked with the council&apos;s Legislation Division to write the first draft of the law.</p><p>(The Legislation Division&apos;s <a href="https://council.nyc.gov/legislation/wp-content/uploads/sites/55/2023/03/NYC-Bill-Drafting-Manual-2022-FINAL.pdf?ref=vikramoberoi.com">Bill Drafting Manual</a> is an interesting read!)</p><p>Then Stevens <strong>formally introduced</strong> the bill at a Stated Meeting in August 2022, where it was assigned to the Committee on Public Safety.</p><p>(Whenever a bill is introduced, it gets assigned to one of <a href="https://council.nyc.gov/committees/?ref=vikramoberoi.com">36 committees</a> responsible for deliberating and holding hearings on bills pertinent to them.)</p><p>A bill&apos;s public record starts the moment it is introduced. Here is 638&apos;s timeline over 16 months:</p><ul><li><strong>August 2022</strong>: the bill was introduced.</li><li><strong>March 2023</strong>: a multi-hour hearing was held and the bill was on the agenda.</li><li><strong>December 2023:</strong> the bill was revised, brought to the floor, and passed.</li></ul><p>There are 18 documents in <a href="https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=5755069&amp;GUID=B2A8438C-77D2-4AE1-9DB4-66AC11D37874&amp;Options=ID%7CText%7COther%7C&amp;Search=Int.+638&amp;ref=vikramoberoi.com">the public record for Introduction 638</a> clustered around these three moments. They include every version of the bill, hearing transcripts, written testimonies, and reports with additional context and data.</p><p>Here&apos;s what all this looks like on Legistar:</p><figure class="kg-card kg-image-card"><img src="https://assets.buttondown.email/images/2177478c-0edb-4f17-b863-b81250508f5f.png?w=960&amp;fit=max" class="kg-image" alt="What&apos;s in the NYC city council&apos;s public records?" loading="lazy"></figure><p>With this data you can:</p><ul><li>Find out what a bill does by reading its text and summaries.</li><li>Gather context and motivations behind a bill in committee reports.</li><li>Learn what agencies and individuals who testified at hearings think about it.</li><li>Find out how much a piece of legislation will cost the city by reading fiscal reports.</li></ul><p>The vast majority of bills never make it to the floor.</p><p>Out of <strong>1,281 bills</strong> that were introduced in the last two years, <strong>only ~20% made it to a vote.</strong> Close to 1,000 bills were drafted and then hit a snag somewhere.</p><p><a href="https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=5656503&amp;GUID=EEF0CBCF-BE73-4CB2-895D-EB252FFB8E09&amp;Options=Advanced&amp;Search=&amp;ref=vikramoberoi.com">Introduction 396</a> is one such bill. It requires preschools to maintain lead levels in drinking water below some amount determined by the Department of Health (DOH), with a yearly inspection.</p><p><a href="https://council.nyc.gov/district-30/?ref=vikramoberoi.com">Councilmember Robert Holden</a>, the bill&apos;s sponsor, introduced 396 at a Stated Meeting in May 2022. It got assigned to the Committee on Health and then nothing happened: there were no hearings, reports, or votes.</p><p>396 seems important, but information beyond the bill&apos;s text isn&apos;t in the public records. I can&apos;t tell you how much of a risk lead poisoning is in our preschools, but a council member&apos;s office thought it might be high enough to address it with legislation.</p><p>Even when bills pass, the records don&apos;t tell us the story behind them. The public information on <a href="https://legistar.council.nyc.gov/Calendar.aspx?ref=vikramoberoi.com">Legistar</a> is sparse and lacks broader context.</p><figure class="kg-card kg-image-card"><img src="https://assets.buttondown.email/images/febb9b03-264a-429b-a7e1-c99d040c04db.png?w=960&amp;fit=max" class="kg-image" alt="What&apos;s in the NYC city council&apos;s public records?" loading="lazy"></figure><p>But there&apos;s still a lot to discover. The council, like every legislative body in the US, <strong>puts out reams of documents that barely anyone reads.</strong></p><p>Surprising details are buried in reports. Here are two from issues <a href="https://buttondown.email/voberoi/archive/jay-z-bike-lanes-housing-and-rats/?ref=vikramoberoi.com">one</a> and <a href="https://buttondown.email/voberoi/archive/coastal-waters-nypd-nycs-overdue-payments/?ref=vikramoberoi.com">three</a> of this newsletter:</p><ul><li>From <a href="https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=6203172&amp;GUID=42410336-8532-45D3-B677-9BBA0F1D2D6C&amp;ref=vikramoberoi.com">a committee report found here</a>: the city pays service providers so late that they keep themselves afloat with loans, paying <strong>an average of $223K annually</strong> in interest.</li><li>From <a href="https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=5871100&amp;GUID=0E5050A7-A87F-49DB-9838-C34497552E75&amp;ref=vikramoberoi.com">a fiscal impact report found here</a>: noise cameras used to detect noise violations cost $35,000 each.</li></ul><p>Bills might get revised and change materially from when they are first introduced. This happened with the bike lane bill covered in this <a href="https://buttondown.email/voberoi/archive/jay-z-bike-lanes-housing-and-rats/?ref=vikramoberoi.com">first issue of Keys to the City Council</a>:</p><ul><li>The first version enables bike lane improvements fewer than 4 blocks in length without a notice period or community board hearing, <strong>on par with other transportation infrastructure.</strong></li><li>The final version of the bill subjects bike lanes of any length to this red tape, <strong>unlike other transportation infrastructure.</strong></li></ul><p>Exchanges between council members and city agencies reveal new information. For <a href="https://buttondown.email/voberoi/archive/the-council-fights-massive-budget-cuts/?ref=vikramoberoi.com">the second issue of Keys to the City Council</a> I poked around a hearing on small business contracts. In it, agencies share how one caterer takes advantage of NYC&apos;s bureaucracy by <strong>charging rates ranging from $3/meal to $15/meal to different agencies for the same meals.</strong></p><p>As I read documents and build tools and techniques to navigate them -- how I spend most of my time on Keys to the City Council -- I&apos;m able to find notable details more quickly.</p><figure class="kg-card kg-image-card"><img src="https://assets.buttondown.email/images/028ecfe4-0a3c-42de-9328-38efda809040.png?w=960&amp;fit=max" class="kg-image" alt="What&apos;s in the NYC city council&apos;s public records?" loading="lazy"></figure><p>I want to share some of the techniques I use to help me navigate legislative documents and hearings with AI. You can try these out at home using ChatGPT.</p><p>I pay for ChatGPT ($20/mo) so I get access to GPT-4, a more powerful language model that I use exclusively because GPT-3.5&apos;s performance on these tasks is much worse.</p><p>The prompts below use bills found on <a href="https://legistar.council.nyc.gov/Calendar.aspx?ref=vikramoberoi.com">Legistar</a> and <a href="https://codelibrary.amlegal.com/codes/newyorkcity/latest/overview?ref=vikramoberoi.com">sections of the NYC legal code found on American Legal Publishing</a>.</p><p>Here&apos;s the prompt I use to explain what a bill does, with a section to provide context from the legal code if necessary:</p><blockquote>Please summarize this NYC bill and explain its implications in a way that a high schooler would understand.<br><br>The bill may refer to sections in the NYC legal code. I have provided the bill and relevant sections of the legal code below.<br><br>--- THE BILL ---<br><br><code>Copy and paste the bill text</code><br><br>--- RELEVANT SECTIONS FROM THE NYC LEGAL CODE ---<br><br><code>Copy and paste relevant sections of the legal code</code></blockquote><p>Asking ChatGPT to explain implications &quot;in a way that a high schooler would understand&quot; is a proxy for &quot;keep it simple&quot; that encodes instructions like &quot;don&apos;t go clause by clause and explain it to me like a legal scholar&quot; and &quot;keep the summary high-level and use basic prose&quot;. It works well.</p><p>I&apos;ll then have follow-up questions and have a full conversation about the bill. I use questions to verify my understanding, dig into details, and ask ChatGPT to spit out clauses, verbatim, that achieve something in particular.</p><p>It&apos;s important to verify ChatGPT&apos;s answer because ChatGPT <a href="https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)?ref=vikramoberoi.com">hallucinates</a>.</p><p>Collectively, all this takes me 2-10 minutes where reading and parsing a bill alone might take me 30 minutes or more.</p><p>To understand how versions of a bill differ, here is how I prompt ChatGPT:</p><blockquote>I will give you two versions of a NYC bill that passed, Version A and Version B. Please tell me what changed from Version A to Version B and what the implications are in a way a high schooler would understand?<br><br>--- VERSION A ---<br><br><code>Copy and paste the text from Version A</code><br><br>--- VERSION B ---<br><br><code>Copy and paste the text from Version B</code></blockquote><p>I used this prompt + followup question to find subtle but material changes in the bike bill covered in the <a href="https://buttondown.email/voberoi/archive/jay-z-bike-lanes-housing-and-rats/?ref=vikramoberoi.com">first issue of Keys to the City Council</a></p><p>I also use it to skip revisions that aren&apos;t notable and save time. The screenshot below compares both versions of <a href="https://legistar.council.nyc.gov/LegislationDetail.aspx?ID=5755069&amp;GUID=B2A8438C-77D2-4AE1-9DB4-66AC11D37874&amp;ref=vikramoberoi.com">Introduction 638</a>: the changes are immaterial.</p><figure class="kg-card kg-image-card"><img src="https://assets.buttondown.email/images/a9f2447f-374d-494a-82df-dd773644717b.png?w=960&amp;fit=max" class="kg-image" alt="What&apos;s in the NYC city council&apos;s public records?" loading="lazy"></figure><p>Outside using prompts to analyze bills, I&apos;ve been building custom tools to find interesting information in hearings. This turns out to be a much more challenging task.</p><p>Part of this is because it&apos;s hard to articulate what &quot;interesting information&quot; even <strong>looks like</strong> in a hearing. To prompt language models so you get output that you want, you need to be wildly pedantic or use proxies that express your intent well enough, like &quot;explain this to me like I&apos;m a high schooler.&quot;</p><p>Saying &quot;tell me surprising facts people said in this hearing&quot; doesn&apos;t work.</p><p>The other reason this is hard is because hearings are long, on the order of 30K to 100K messily-transcribed words. Language models struggle to get useful information out of large bodies of text <strong>even when your prompt works well on smaller chunks of text.</strong></p><p>So I&apos;m hacking on research tools to help me navigate hearings more quickly instead of relying on GPT-4 to give me answers.</p><p>In particular, I&apos;d like a page where I can see a video of a hearing on the top and on the bottom:</p><ul><li>Neatly segmented chapters of the hearing: &quot;Roll Call&quot;, &quot;Brooks-Powers expresses concern about NYC&apos;s rat problem in her opening statement&quot;, &quot;An exchange about wheelie bins with DSNY&quot;, etc.</li><li>Each chapter has an associated summary that I can skim, with relevant pull quotes.</li><li>I can expand the transcript in each chapter if I need to skim it.</li><li>I can click on any of these things to automatically seek to a point in a 1-to-4-hour video: a chapter, a pull quote, a single word in the transcript.</li></ul><p>Doing this requires creating smaller transcript chunks for GPT-4 to digest, multiple steps involving different prompts, and a lot of manual evaluation to test and improve the quality of chapters, summaries, and pull quotes.</p><p>I share more about this work on <a href="https://twitter.com/voberoi?ref=vikramoberoi.com">Twitter</a> and <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com">Threads</a> if you&apos;d like to follow along.</p><p>I&apos;m building this tool for Keys to the City Council. I think it could be useful to reporters and government affairs folks. If you or someone you know would like to chat about it, <a>please send me a message</a>!</p><figure class="kg-card kg-image-card"><img src="https://assets.buttondown.email/images/5f8121ef-c6a4-4492-8390-ce1190cc48db.png?w=960&amp;fit=max" class="kg-image" alt="What&apos;s in the NYC city council&apos;s public records?" loading="lazy"></figure><p>Going forward, I&apos;m going to focus Keys to the City Council on bills and records that are interesting but not widely covered: new legislation, important nuances in bills and revisions, and context buried in hearings and reports.</p><p>In some cases I&apos;ll add to mainstream stories -- for example, sharing multiple major bills in the NYPD transparency package that were overshadowed by the one that Adams is likely to veto. Those bills, alone, should generate significant, new public information about NYPD&apos;s operations in the next 12-24 months.</p><p>I&apos;m also going to match the council&apos;s cadence and write the newsletter <strong>every two weeks</strong> when the council votes. I&apos;ll play around with same format I used in <a href="https://buttondown.email/voberoi/archive/jay-z-bike-lanes-housing-and-rats/?ref=vikramoberoi.com">issue 1</a> and <a href="https://buttondown.email/voberoi/archive/coastal-waters-nypd-nycs-overdue-payments/?ref=vikramoberoi.com">issue 3</a>, sharing new legislation along with details.</p><p>I won&apos;t cover major stories like the budget cuts in <a href="https://buttondown.email/voberoi/archive/the-council-fights-massive-budget-cuts/?ref=vikramoberoi.com">issue 2</a>. The cuts were covered everywhere, and stories like it require context absent from public records that I don&apos;t have, that journalists gather, think, and write about all day.</p><p>It was fun to write that issue, but it&apos;s not in my wheelhouse and it took effort and time away from what I want do: poke my (AI-enabled) nose into public records and share what I find.</p><hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[How many concurrent conversations can you sustain before you hit OpenAI's rate limits?]]></title><description><![CDATA[A simple model to ballpark how many concurrent conversations your users can have with your GPT-powered bot before you get rate-limited.]]></description><link>https://www.vikramoberoi.com/how-many-concurrent-chats-can-you-support-before-you-hit-openais-rate-limits/</link><guid isPermaLink="false">64d68a216a2f8e000112d3e4</guid><category><![CDATA[openai]]></category><category><![CDATA[chatbots]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Fri, 11 Aug 2023 20:27:22 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2023/08/turing_test-2.jpg" medium="image"/><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/08/turing_test.jpg" class="kg-image" alt="How many concurrent conversations can you sustain before you hit OpenAI&apos;s rate limits?" loading="lazy" width="950" height="950" srcset="https://www.vikramoberoi.com/content/images/size/w600/2023/08/turing_test.jpg 600w, https://www.vikramoberoi.com/content/images/2023/08/turing_test.jpg 950w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">This bot can unfortunately sustain at most one concurrent conversation. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><img src="https://www.vikramoberoi.com/content/images/2023/08/turing_test-2.jpg" alt="How many concurrent conversations can you sustain before you hit OpenAI&apos;s rate limits?"><p>Over at <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a> we&apos;ve been working with a client to build an LLM-powered chat bot that we&apos;re deploying into Fortune 100 enterprises.</p><p>Employees at these enormous companies will chat with this bot and, uh, unfortunately I cannot share much more than that.</p><p>I recently had to ballpark how many concurrent chats we could support before we&apos;d hit our OpenAI rate limit. I came up with a basic model and am sharing it here in case it comes in handy for others.</p><h2 id="heres-the-kind-of-bot-for-which-this-ballpark-estimate-makes-sense">Here&apos;s the kind of bot for which this ballpark estimate makes sense:</h2><ul><li>It uses OpenAI&apos;s ChatCompletion APIs.</li><li>The user talks, the bot talks, the user talks, the bot talks, and so on...</li><li>Every time a user sends a chat, that chat is sent to a ChatCompletion endpoint along with the entire message stack: all <code>system</code>, <code>assistant</code>, and <code>user</code> messages up to and including the user&apos;s message.</li></ul><p>This seems like it holds for gobs of products and features I see in the market.</p><h2 id="here-are-assumptions-the-model-makes">Here are assumptions the model makes:</h2><ul><li>The concurrent chats are all fresh, new conversations as opposed to a long conversation being re-entered.</li><li>They all start at the same time.</li><li>We&apos;re looking for the number of these conversations that can run concurrently...</li><li>... without any of them running into one of OpenAI&apos;s rate limits.</li></ul><h2 id="deriving-a-ballpark-estimate">Deriving a ballpark estimate</h2><p>OpenAI&apos;s rate limits are comprised of two values:</p><ol><li><strong>TPM:</strong> tokens per minute.</li><li><strong>RPM:</strong> requests per minute.</li></ol><p>GPT-4&#x2019;s base values for these are 40,000 TPM, 200 RPM. You are rate limited the moment you hit one of these.</p><p>This means that if 200 people log in and send a message, your service falls over immediately. Everyone will send one message, and then their next message will fail.</p><p>What we want is: what is the maximum # of users can sustain a conversation without getting rate limited?</p><p>That number, whatever it is, is far less than 200.</p><p>To get a rough estimate, you want to grab these numbers for your bot:</p><ul><li>The average<strong> Chat RPM:</strong> The average # of requests a regular <strong>chat session</strong> makes per minute.</li><li><strong> </strong>The average<strong> Chat TPM:</strong> The average # of tokens a regular <strong>chat session</strong> sends per minute.</li></ul><p>I recommend computing these from real chat sessions your users have done:</p><ul><li><strong>Chat RPM</strong> maps to the number of <code>user</code> role messages sent per minute in a session.</li><li><strong>Chat TPM </strong>takes more work. Whenever a user sends a message, you have to sum up all the tokens of all messages up to and including the user&apos;s message. Add up each of these token counts, divide them by the # of minutes the chat lasted.</li><li>Use <a href="https://github.com/openai/tiktoken?ref=vikramoberoi.com">tiktoken</a> to count tokens.</li></ul><p>Between OpenAI&apos;s response times, a user pausing/thinking, and then sending a message, you&apos;re likely to get somewhere between 1-3 requests per minute per chat.</p><p>Meanwhile, because the number of tokens you send to OpenAI every time a user chats <em>grows cumulatively</em>, <strong>Chat TPM </strong>is your likely bottleneck and it might even surprise you how big it gets.</p><p>I recommend capping the length of these conversations at what you believe is a reasonable duration. You want to remove outliers &#x2013; for example, if a user had a conversation on day 1 and then came back on day 2 to send another message.</p><p>(Unless that&apos;s the norm for your app, in which case you might not want to use this model.)</p><p>Once you have these numbers, the # of users who can sustain a conversation concurrently is the lesser of:</p><ul><li><strong>OpenAI RPM Rate Limit / </strong>the average <strong>Chat RPM</strong></li><li><strong>OpenAI TPM Rate Limit / </strong>the average <strong>Chat TPM</strong></li></ul><p>That&apos;s it!</p><p>The longer your user&apos;s conversations last, the fewer of them you&apos;ll be able to sustain concurrently. Again, this is because of the cumulative growth in the tokens you send over the course of a conversation: whenever you send a ChatCompletion request, you&apos;re sending over the entire conversation every time.</p><hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[An internship working on "Customers who bought this also bought" at Amazon 16 years ago]]></title><description><![CDATA[<p>I wrote <a href="https://twitter.com/voberoi/status/1662790190761377792?ref=vikramoberoi.com">this tweet</a> about my Amazon internship in 2007 as I rolled out of bed yesterday morning and it went viral.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png" class="kg-image" alt loading="lazy" width="1504" height="1602" srcset="https://www.vikramoberoi.com/content/images/size/w600/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png 1000w, https://www.vikramoberoi.com/content/images/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png 1504w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">(Yes, you are correct: there is only one Harry Potter 7 book. And no, I did not recommend the second one into print, but I do love that</span></figcaption></figure>]]></description><link>https://www.vikramoberoi.com/an-internship-working-on-customers-who-bought-this-also-bought-at-amazon-16-years-ago/</link><guid isPermaLink="false">647498cd6dcef900019dc711</guid><category><![CDATA[amazon]]></category><category><![CDATA[internship]]></category><category><![CDATA[recommendations]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Mon, 29 May 2023 16:14:06 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.33.36-1.png" medium="image"/><content:encoded><![CDATA[<img src="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.33.36-1.png" alt="An internship working on &quot;Customers who bought this also bought&quot; at Amazon 16 years ago"><p>I wrote <a href="https://twitter.com/voberoi/status/1662790190761377792?ref=vikramoberoi.com">this tweet</a> about my Amazon internship in 2007 as I rolled out of bed yesterday morning and it went viral.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png" class="kg-image" alt="An internship working on &quot;Customers who bought this also bought&quot; at Amazon 16 years ago" loading="lazy" width="1504" height="1602" srcset="https://www.vikramoberoi.com/content/images/size/w600/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png 1000w, https://www.vikramoberoi.com/content/images/2023/05/Screen-Shot-2023-05-29-at-10.49.47-AM.png 1504w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">(Yes, you are correct: there is only one Harry Potter 7 book. And no, I did not recommend the second one into print, but I do love that joke.)</span></figcaption></figure><p>How fun!</p><p>That internship was my first work experience overall in software and I got incredibly lucky being placed on the Similarities team.</p><p>One could do way worse as a 20-year old landing an internship at Amazon: the Similarities team was doing cutting edge recommendations and experimentation work, I got to deploy and see feedback from my work on-site every week, and I was paired with a kind and helpful mentor on a great team.</p><p>My experience that summer is the reason I continued doing data &amp; systems work in college and after I graduated.</p><p>The Similarities team was responsible for producing the dataset that powered &quot;Customers who bought this also bought&quot;. The dataset was also used in personalized recommendations they&apos;d surface to customers elsewhere: on-site and in emails.</p><p>~20% of revenue was attributed to similarities at the time if I remember correctly (<a href="https://twitter.com/voberoi/status/1662846509543505920?s=20&amp;ref=vikramoberoi.com">my mentor was kind enough to terrify me with an estimate of the revenue Amazon lost from an an outage I initiated</a>). So they were an effective and important part of the site. But there were still a number of unintuitive similarities that would appear in the &quot;Customers who bought this also bought&quot; widget.</p><p>While there was a long tail of random issues in different product categories, &#x2013; Amazon had begun a rapid expansion beyond books and was working on fixing them &#x2013; the biggest problem, by far, was Harry Potter.</p><p>More specifically, in the summer of 2007 it was Harry Potter and the Deathly Hallows.</p><p>It would show up as a similarity <strong><em>everywhere</em></strong>. Like, you&apos;d be on Amazon buying a mop and the similarities widget would show a recommendation for Harry Potter and the Deathly Hallows, followed by Pine-Sol and a bucket.</p><p>The team computed item-to-item similarities using collaborative filtering with customer order baskets as the input. Put simply, if customers bought products A and B together frequently enough, then Amazon would present A and B as similar items.</p><p>But what happens when the same product appears in virtually every order basket? <a href="https://glinden.blogspot.com/2006/03/early-amazon-similarities.html?ref=vikramoberoi.com">The Harry Potter problem</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/05/order-of-the-phoenix-1.png" class="kg-image" alt="An internship working on &quot;Customers who bought this also bought&quot; at Amazon 16 years ago" loading="lazy" width="2000" height="2000" srcset="https://www.vikramoberoi.com/content/images/size/w600/2023/05/order-of-the-phoenix-1.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2023/05/order-of-the-phoenix-1.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2023/05/order-of-the-phoenix-1.png 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2023/05/order-of-the-phoenix-1.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><p>The similarities team deployed me to one idea they had to address the Harry Potter problem: could they use feedback from users to cull unintuitive similarities?</p><p>Amazon got feedback in two ways:</p><ul><li><strong>Implicitly: </strong>clickstream and conversion data from the similarities widget (&quot;Customers who bought this also bought&quot;)</li><li><strong>Explicitly: </strong>users could provide feedback on personalized recommendations they received in emails/on-site. I don&apos;t know if this still exists.</li></ul><p>So I spent the summer trying to use the output of Amazon&apos;s collaborative filtering algorithm + all this clickstream/conversion/feedback data to address unintuitive similarities. The thinking was that if a similarity was unintuitive, then presumably it&apos;d underperform by some measure based on user feedback.</p><p>My primary target was Harry Potter and the Deathly Hallows: it stuck out like a sore thumb and it was an easy way to see if an approach was working qualitatively.</p><p>Amazon similarities were served by a <a href="https://en.wikipedia.org/wiki/Berkeley_DB?ref=vikramoberoi.com">Berkeley DB</a> (BDB) file at the time. BDBs are embedded key-value stores &#x2013; a file you can ship around with a format optimized for key-value lookups. Amazon would crunch numbers and emit a new similarities BDB nightly or weekly (I don&apos;t remember which).</p><p>The similarities BDB mapped ASINs (product IDs in Amazon parlance) to lists of ASINs, like this:</p><p><strong><code>B00P0H6836</code>: </strong><code>B07K8Y6CMP</code>, <code>B07K8RWVF9</code>, <code>B07K8S9ZQZ</code>, <code>B00P0H6836</code></p><p>That&apos;s the ASIN for <a href="https://www.amazon.com/SmartCat-Natural-Clumping-Litter-20-Pound/dp/B00P0H6836?pd_rd_w=gROvs&amp;content-id=amzn1.sym.724fac2e-0491-4f7a-a10d-2221f9a8bc9a&amp;pf_rd_p=724fac2e-0491-4f7a-a10d-2221f9a8bc9a&amp;pf_rd_r=VYXKSS0B0J2XYHYSZ4BS&amp;pd_rd_wg=VclKq&amp;pd_rd_r=98fea35d-1faa-4717-aa14-a318ee9cceb2&amp;pd_rd_i=B00P0H6836&amp;ref_=pd_bap_d_grid_rp_0_1_ec_i&amp;th=1&amp;ref=vikramoberoi.com">this great cat litter</a> mapped to different sizes of pee pad refills in the &quot;Compare with similar items&quot;<strong> </strong>section. This is also an example of an unintuitive similarity: if I use clumping litter for my cat, it is unlikely that I will use pee pads too.</p><p>So each week I&apos;d write a Perl script to crunch clickstream/conversion/feedback data and then remove or reorder some of the mappings in the BDB file based on my algorithm that week. We&apos;d now have two sets of similarities:</p><ul><li><strong>A: </strong>the similarities emitted by Amazon&apos;s collaborative filtering algorithm</li><li><strong>B: </strong><code>vikrams_algorithm_of_the_week(A)</code>, Amazon&apos;s similarities from A with mappings removed or reordered using whatever approach I was trying.</li></ul><p>Then I&apos;d send an email to the team with a link to a CGI script I wrote that allowed us to qualitatively assess B against A. It was a little web page with an input box at the top where you could enter an ASIN and it would show you similarities from A compared to similarities from B.</p><p>We&apos;d exchange emails about the quality of my similarities or talk about them in a meeting. Then my mentor would decide whether or not we would push it to production.</p><p>Most weeks we&apos;d push something to production and A/B test it. I don&apos;t know what percentage of the site saw my similarities, but I suspect it was low: it would take a few days for us to get conclusive results and even in 2007 Amazon got <em>tons </em>of traffic.</p><p>I threw a lot of things at the wall.</p><p>Two approaches that I remember relied on the idea that users will simply click more on items on the left side of a page. It is well-known that page position is a massive determinant of clickthrough rates. The &quot;Customers who Bought this Also Bought&quot; widget was laid out from left to right, so items in the first slot had a baked-in &quot;boost&quot;.</p><p>So, if that is true yet we see a similarity in the first slot &quot;underperform&quot;, maybe we should reorder it. Here are two different ways I did that:</p><ol><li><strong>A basic approach: </strong>if an ASIN in slot 1 has a lower clickthrough rate than an ASIN in slot 2, swap slots 1 and 2.</li><li><strong>A more complicated approach: </strong>if an ASIN in slot 1 has a statistically lower clickthrough rate compared to the ASIN in slot 2, swap slots 1 and 2.</li></ol><p>I don&apos;t remember the specifics for #2. It might have been something like:</p><ol><li>Get the difference in clickthrough rate between slots 1 and 2.</li><li>See where it falls on the distribution of &quot;difference in clickthrough rate between slots 1 and 2&quot;.</li><li>Decide it&apos;s underperforming if it is some distance away from the mean.</li></ol><p>I spent the entire summer trying stuff like this and published a log of what I did on Amazon&apos;s internal wiki.</p><p>Some takeaways I recall:</p><ul><li><strong>None of these approaches improved conversions over the baseline: </strong>I do remember thinking my similarities were better in some of our qualitative assessments, but, site-wide, customers voted with their $ and showed that they were worse.</li><li><strong>Simple approaches are better than complicated ones: </strong>any time I got fancy, like in the more complicated approach above, the performance of my similarities tanked.</li><li><strong>Invalidating a bunch of approaches and documenting them is still useful: </strong>with this work, the team had a set of hypotheses they could either ignore or approach again in the future with more clarity. It&apos;s a great project to deploy an intern to.</li><li><strong>None of these approaches come close to solving the Harry Potter problem: </strong>that book&apos;s Amazon detail page will be forever etched in my brain.</li></ul><p>There are many wildly qualified people who have worked on similarities and recommendations at Amazon, Netflix and elsewhere the last 15 years. Greg Linden started and led a lot of the personalization work at Amazon and has written about some of it <a href="https://glinden.blogspot.com/?ref=vikramoberoi.com">on his blog</a>.</p><p>I don&apos;t know if Greg was there in 2007. My mentor and the team&apos;s manager during my internship keep a much lower profile, but were also extremely talented.</p><p>&#x1F44B; to Wes &amp; Brent if you see this! Thank you for setting me up with a delightful and impactful experience 16 years ago.</p><hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[Using Metaphone to handle bad transcriptions in voice search for sommeliers]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/05/bored-duck.jpg" class="kg-image" alt="A man places an order for &quot;bored duck&apos;s shadow&quot;. A duck wearing a tuxedo takes the order and correctly identifies the man&apos;s intent to order a &quot;Bordeaux Chateau&quot;." loading="lazy" width="2000" height="2000" srcset="https://www.vikramoberoi.com/content/images/size/w600/2023/05/bored-duck.jpg 600w, https://www.vikramoberoi.com/content/images/size/w1000/2023/05/bored-duck.jpg 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2023/05/bored-duck.jpg 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2023/05/bored-duck.jpg 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">If OpenAI&apos;s Whisper was as good at transcribing wine names as this duck/skilled sommelier, this project would have been a lot simpler. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><p>I recently prototyped a hands-free inventory counting system for sommeliers at a popular Manhattan Italian restaurant with Gabriel, my partner at <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a></p>]]></description><link>https://www.vikramoberoi.com/using-metaphone-to-handle-bad-transcriptions-in-wine-voice-search-for-sommeliers/</link><guid isPermaLink="false">645a7110c08b180001e13f41</guid><category><![CDATA[search]]></category><category><![CDATA[duckdb]]></category><category><![CDATA[voice search]]></category><category><![CDATA[metaphone]]></category><category><![CDATA[phonetic algorithms]]></category><category><![CDATA[wine]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Mon, 22 May 2023 17:48:18 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.36.22.png" medium="image"/><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/05/bored-duck.jpg" class="kg-image" alt="Using Metaphone to handle bad transcriptions in voice search for sommeliers" loading="lazy" width="2000" height="2000" srcset="https://www.vikramoberoi.com/content/images/size/w600/2023/05/bored-duck.jpg 600w, https://www.vikramoberoi.com/content/images/size/w1000/2023/05/bored-duck.jpg 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2023/05/bored-duck.jpg 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2023/05/bored-duck.jpg 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">If OpenAI&apos;s Whisper was as good at transcribing wine names as this duck/skilled sommelier, this project would have been a lot simpler. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><img src="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.36.22.png" alt="Using Metaphone to handle bad transcriptions in voice search for sommeliers"><p>I recently prototyped a hands-free inventory counting system for sommeliers at a popular Manhattan Italian restaurant with Gabriel, my partner at <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>.</p><p>I know the general manager (GM) at this restaurant &#x2013; henceforth referred to as <em>Ristorante &#x2013; </em>personally and he hates inventory counts. Most restaurant GMs do. They&apos;re labor-intensive yet extremely important: regular inventory audits are critical for restaurants to control their margins.</p><p>Ristorante can clear well into 5 figures of revenue in a day. And its GM, being a skilled and responsible operator, requires his team to do an inventory audit every month. So Ristorante&apos;s two sommeliers arrive early in the morning on the 1st of every month to count over 1,500 different wines stored across 3 bars, a cellar, a fridge, and a storage room. It takes them over 6 hours, and the task bleeds into daily operations.</p><p>It&apos;s painful.</p><p>It&apos;s also just the kind of messy real-world problem that Gabriel and I want to try to solve for a market: is there any way we could cut the time it took these sommeliers to count inventory by half?</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/05/inventory-venn-1.jpg" class="kg-image" alt="Using Metaphone to handle bad transcriptions in voice search for sommeliers" loading="lazy" width="2000" height="1908" srcset="https://www.vikramoberoi.com/content/images/size/w600/2023/05/inventory-venn-1.jpg 600w, https://www.vikramoberoi.com/content/images/size/w1000/2023/05/inventory-venn-1.jpg 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2023/05/inventory-venn-1.jpg 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2023/05/inventory-venn-1.jpg 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Inventory counts lie at this cursed intersection for restaurant GMs.</span></figcaption></figure><p>I popped into Ristorante on April 1st, a busy Saturday, to observe an inventory count.</p><p>Ristorante&apos;s wine storage is organized but cramped. At the bars, many wine bottles sit on high shelves and sommeliers might be standing precariously on a stool or counter. Their hands are often not available &#x2013; they&apos;re moving boxes, bottles, or holding a bottle up so they can identify what it is.</p><p>Counting wine looks something like this:</p><ol><li>Identify the box/bin or a bottle of wine you want to count. Count it.</li><li>With a laptop or phone next to you, look up the wine you&apos;re looking for in <a href="https://craftable.com/bevager/?ref=vikramoberoi.com">Bevager</a>, Ristorante&apos;s inventory management system.</li><li>Enter the count and save it.</li></ol><p>Each wine takes anywhere from 15-60 seconds, <strong><em>of which 80% of the time goes to step 2: looking up the wine in Bevager. </em></strong>This is due to a combination of the following:</p><ul><li><strong>Context switching:</strong> from using one&apos;s arms and being perched somewhere, to using a laptop or phone.</li><li><strong>Wine names are entered into Bevager inconsistently: </strong>in some cases they use the wine name, in others they use the producer. These names might be on the front or back of the wine.</li><li><strong>Words in wine names are frequently repeated: </strong>&quot;chateau&quot;, for example.</li><li><strong>Vintages: </strong>you might need to pick the right vintage from 5+ entries.</li><li><strong>Bevager bugginess/search quality: </strong>Bevager is a fine system, but its inventory audit feature could use some love.</li></ul><p>These are all solvable problems that take up a significant percentage of the time it takes to inventory one wine bottle.</p><p>So we decided to start by helping a sommelier retrieve a wine in Ristorante&apos;s inventory system quickly and accurately without having to use her hands, with her voice.</p><h2 id="prototyping-voice-search-for-wines-in-3-days">Prototyping voice search for wines in 3 days</h2><p>A critical component of our prototype was a voice search feature that allowed a sommelier to say the name of a wine and retrieve it correctly from over 1,500 SKUs in her inventory.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2023/05/screenshot-4.png" class="kg-image" alt="Using Metaphone to handle bad transcriptions in voice search for sommeliers" loading="lazy" width="297" height="600"><figcaption><span style="white-space: pre-wrap;">Our prototype. This transcription mangles &quot;Chateau la Bridane Saint Julien&quot;, which still shows up in the top 5 results.</span></figcaption></figure><p>There are a bunch of obstacles to getting this kind of voice search to work well.</p><ul><li><strong>Getting accurate transcriptions, even with state of the art models, is hard. </strong>We have to deal with French and Italian wines, various accents, and restaurant background noise.</li><li><strong>Inaccurate transcriptions are different from typos and misspellings. </strong>Handling them requires a different set of tricks.</li><li><strong>The user&apos;s not at a keyboard. </strong>Auto-complete is an incredible feedback mechanism, but we don&apos;t have access to it.</li><li><strong>There is exactly one relevant record. </strong>A sommelier&apos;s search for the wine they&apos;re looking for has only one matching record. It <em>must</em> be a top result, otherwise the search failed.</li></ul><p>Perhaps the closest analog to this search use case is when you ask your phone to call a contact (e.g. &quot;Google, call Katie.&quot;). It turns out that this is pretty challenging!</p><p>Our goal was to validate feasibility and demand for a solution, so we needed to show something working well enough quickly.</p><p>I also wanted this to be easy to understand and operate: I didn&apos;t want to have to (re-)learn, configure, and tune Solr, ElasticSearch, or Typesense. Besides, I only had about 1,500 records and these tools aren&apos;t really designed to deal with the search use case I describe above, out of the box.</p><p>I managed to cobble together workable voice search in about 3 days using off-the-shelf components, a bag of information retrieval tricks, ~300 lines of Python, and a bunch of trial and error. It&apos;s imperfect but it&apos;s quick to implement, simple to operate, and it works with many classes of transcription errors with plenty of avenues for improvement.</p><ul><li><strong>Code: </strong>The codebase is <a href="https://github.com/voberoi/voice-search-with-whisper-duckdb-and-metaphone?ref=vikramoberoi.com">on Github</a>.</li><li><strong>Demo: </strong>You can <a href="https://voice-search-with-whisper-duckdb-and-metaphone.streamlit.app/?ref=vikramoberoi.com">give it a try on Streamlit Cloud</a>.</li><li><strong>Is this available? What happened next? Can you make this for me? </strong>I answer these questions <a href="#appendix">in the Appendix</a>.</li></ul><p>Here&apos;s how it works!</p><h2 id="a-summary-of-my-approach">A summary of my approach</h2><p>I use <a href="https://github.com/openai/whisper?ref=vikramoberoi.com">OpenAI&apos;s Whisper</a> for audio transcription and <a href="https://duckdb.org/?ref=vikramoberoi.com">DuckDB</a> to store and query my search index.</p><p><a href="https://duckdb.org/docs/extensions/full_text_search?ref=vikramoberoi.com">DuckDB&apos;s full text search (FTS) extension</a> easily handles the happy path: accurate transcriptions. The FTS extension is easy to use, but it does not handle misspellings or, in my case, poorly-transcribed audio.</p><p>OpenAI&apos;s Whisper model is impressive, but:</p><ul><li>There is background noise in restaurants.</li><li>Non-native speakers pronounce French and Italian wines poorly.</li><li>Both of these issues affect audio clarity and, in turn, transcription quality.</li><li>Even without these issues, Whisper is not perfect.</li></ul><p>So my implementation needs to handle inaccurate transcriptions. </p><p>While there&apos;s plenty of prior art on handling misspellings in search, the same techniques don&apos;t work as well with bad audio transcriptions. This is because inaccurate transcriptions don&apos;t look like misspelled words. They often look like gibberish &#x2013; made up words, or strings of words that sound like the audio but don&apos;t make sense together.</p><p>So I built two of my own search indexes in DuckDB to handle &quot;mistranscriptions&quot; using <a href="https://en.wikipedia.org/wiki/Metaphone?ref=vikramoberoi.com">Metaphone</a>, a phonetic algorithm that maps similar-sounding words to the same pronunciation.</p><p>In total, I have 3 indexes: 1 full text search index on wine names, and 2 that use Metaphones, which I&apos;ll cover in greater detail below</p><p>At query time I do the following:</p><ol><li><strong>Query the indexes in 5 different ways.</strong> Each method excels at different kinds of queries or transcription errors. I get 10 results from each method, increasing the likelihood that the wine the user is looking for is in my result set.</li><li><strong>Rank all those results using text similarity. </strong>I get the Jaro-Winkler similarity score between each result and the transcription. This tends to bubble the result up to the top 5.</li><li><strong>Return the top 5 ranked results. </strong>I&apos;d say 80%+ of the time the result I was looking for ended up in the top 5, and it fared well with &quot;adversarial&quot; audio (comically bad accents, yelling in the background, etc.)</li></ol><h2 id="what-do-inaccurate-transcriptions-look-like">What do inaccurate transcriptions look like?</h2><p>Inaccurate transcriptions are usually close to how the words actually sound.</p><p>Let&apos;s say we have a wine in our inventory called Chateau Champignon.</p><p>Here are some potential transcriptions of &quot;chateau&quot;:</p><ul><li>shadow</li><li>shatto</li><li>chat oh</li></ul><p>Here are some potential transcriptions of &quot;champignon&quot;:</p><ul><li>champagne on</li><li>shomp inyon</li><li>champ onion</li><li>sham pig non</li></ul><p>We may end up with any permutation of transcriptions for &quot;chateau&quot; and &quot;champignon&quot;. For example:</p><ul><li>shadow champagne on</li><li>chat oh shomp inyon</li><li>shatto sham pig non</li></ul><p>Traditional search techniques to handle misspellings fail spectacularly on these kinds of errors. None of them look like misspellings or typos.</p><p>But they do sound the same, and this is where phonetic algorithms come in.</p><h2 id="you-can-map-inaccurate-transcriptions-to-the-same-string-using-phonetic-algorithms">You can map inaccurate transcriptions to the same string using phonetic algorithms</h2><p>Phonetic algorithms map words to pronunciations.</p><p>Different phonetic algorithms have different goals in mind, but there is a whole class of them that is designed to map similar-sounding words to the same pronunciation.</p><p>The earliest one of these is called Soundex, which was developed in the late 1800&apos;s and patented in 1920, well before the advent of computing!</p><p>Why? To map similar-sounding surnames to each other for the US Census. The history here is really neat, and you can read <a href="https://medium.com/@lukehenryotwell/the-soundex-algorithm-d39c2f8d8756?ref=vikramoberoi.com">this post by Luke Otwell</a> to learn more if you&apos;re interested.</p><p>In my implementation I used Metaphone, a successor to Soundex, for which there is an open-source Python implementation in <code>pyphonetics</code>.</p><p>Metaphone maps <code>chateau champignon</code> to <strong><code>XTXMPNN</code>.</strong></p><p>Here&apos;s how Metaphone maps our transcriptions of Chateau Champignon above:</p><p><strong>chateau</strong></p><ul><li><code>shadow</code>: <strong><code>XT</code></strong></li><li><code>chat oh</code>: <strong><code>XT</code></strong></li><li><code>shatto</code>: <strong><code>XT</code></strong></li></ul><p><strong>champignon</strong></p><ul><li><code>champagne on</code>: <strong><code>XMPNN</code></strong></li><li><code>shomp inyon</code>: <strong><code>XMPNYN</code></strong></li><li><code>champ onion</code>: <strong><code>XMPNN</code></strong></li><li><code>sham pig non</code>: <strong><code>XMPNN</code></strong></li></ul><p><strong>various permutations of the above</strong></p><ul><li><code>shadow champagne on</code>: <strong><code>XTXMPNN</code></strong></li><li><code>chat oh shomp inyon</code>: <strong><code>XTXMPNYN</code></strong></li><li><code>shatto sham pig non</code>: <strong><code>XTXMPNN</code></strong></li></ul><p>Almost all of these map directly to the same Metaphone as Chateau Champignon!</p><p>To put this insight to work, our code creates two additional indexes: the Exact Metaphone Index and the Metaphone Token Index. <a href="https://github.com/voberoi/voice-search-with-whisper-duckdb-and-metaphone/blob/main/index.py?ref=vikramoberoi.com">These indexes are constructed in <code>index.py</code>.</a></p><h2 id="the-two-metaphone-based-indexes">The two Metaphone-based indexes</h2><p>Assume we have three wines in our inventory:</p><ol><li>Chateau Champignon</li><li>Cavalleri Collezione Esclusiva Chardonnay</li><li>Chateau la Bridane Saint Julien</li></ol><p>Here&apos;s how we construct our two Metaphone indexes on those wines.</p><h3 id="1-the-exact-metaphone-index">1. The Exact Metaphone Index</h3><p>In this index we map the Metaphone for each wine&apos;s full name to itself:</p><ul><li><strong><code>XTXMPNN</code> </strong>&#x2192; <code>Chateau Champignon</code></li><li><strong><code>KFLRKLSNSKLSFXRTN</code> </strong>&#x2192; <code>Cavalleri Collezione Esclusiva Chardonnay</code></li><li><strong><code>XTLBRTNSNTJLN</code> </strong>&#x2192; <code>Chateau la Bridane Saint Julien</code></li></ul><p>This index is just a 1-to-1 mapping. It allows us to reverse look up a metaphone to a wine&apos;s full name as it exists in our index.</p><h3 id="2-the-metaphone-token-index">2. The Metaphone Token Index</h3><p>In this index we tokenize each wine name, get the Metaphone for each token, and construct an index on those tokens.</p><p>Here&apos;s what the first step looks like:</p><ol><li><code>Chateau Champignon</code> &#x2192; <strong><code>XT XMPNN</code></strong></li><li><code>Cavaelleri Collezione Esclusiva Chardonnay</code> &#x2192; <strong><code>KFLR KLSN ESKLSF XRTN</code></strong></li><li><code>Chateau la Bridane Saint Julien</code> &#x2192; <strong><code>XT L BRTN SNT JLN</code></strong></li></ol><p>Let&apos;s call these &quot;Metaphone strings&quot;. Each token in the string is the Metaphone for a token in the original wine name. These strings are what we index.</p><p>Assuming that 1, 2, 3 are IDs that map to the wines in the list above, the Metaphone Token Index looks like this:</p><ul><li><code><strong>XT</strong></code> &#x2192; <code>[1, 3]</code></li><li><code><strong>XMPNN</strong></code> &#x2192; <code>[1]</code></li><li><code><strong>KFLR</strong></code> &#x2192; <code>[2]</code></li><li><code><strong>KLSN</strong></code> &#x2192; <code>[2]</code></li><li>... and so on.</li></ul><p>Conveniently, this is just a regular full text search index on the Metaphone strings above. So we can use DuckDB&apos;s full-text search extension again, saving the work to write code to construct or query the index.</p><p>This index allows us to reverse look up Metaphones for individual words in our transcript.</p><p>For example, a speaker might say only &quot;champignon&quot;, for which we get an inaccurate transcription like &quot;shampinon&quot;. Or a speaker might say 3 words in a wine name, but two of them are transcribed entirely incorrectly while the third has a Metaphone match.</p><p>This index gives us the means to salvage an inaccurate transcript as long as there&apos;s a Metaphone match in it, somewhere.</p><h2 id="five-query-time-approaches">Five query-time approaches</h2><p>Combined, the following approaches are greater than the sum of their parts. Individually, they excel at dealing with different kinds of queries and transcription errors.</p><p><a href="https://github.com/voberoi/voice-search-with-whisper-duckdb-and-metaphone/blob/main/search.py?ref=vikramoberoi.com">Each of the query methods is implemented in <code>search.py</code></a>.</p><p>Let&apos;s add one more wine to our index for the following explanations. There are now 4 wines in our inventory:</p><ol><li>Chateau Champignon</li><li>Cavalleri Collezione Esclusiva Chardonnay</li><li>Chateau la Bridane Saint Julien</li><li>Guy Breton Morgon</li></ol><p>Here are the five query approaches we use.</p><h3 id="1-duckdb-full-text-query">1. DuckDB Full-Text Query</h3><p>This was the first one I implemented, and it uses DuckDB&apos;s full-text search extension.</p><p>Under the hood, DuckDB&apos;s FTS extension implements BM-25, a widely-used ranking function.</p><p>It constructs the index with some basic defaults for English: a list of 571 stopwords, the Porter stemmer, lowercasing all text, ignoring all non-alphabetic lowercase characters, and stripping accents.</p><p><strong>Pros</strong></p><ul><li>BM-25&apos;s ranking algorithm performs well: a query for &quot;chateau champignon&quot; will rank Chateau Champignon first, followed by the other hundred that include &quot;chateau&quot;.</li><li>It works well as long as the most unique words of the wine name are transcribed accurately. For example, &quot;shadow champignon&quot; will likely return &quot;Chateau Champignon&quot;, but &quot;chateau shampinyon&quot; is a crapshoot because there are hundreds of wines with &quot;chateau&quot; in their name.</li></ul><p><strong>Cons</strong></p><ul><li>It fails spectacularly on inaccurate transcriptions, usually returning zero or irrelevant results.</li></ul><h3 id="2-exact-metaphone-query">2. Exact Metaphone Query</h3><p>Here, we do a reverse lookup for the transcription&apos;s Metaphone in the Exact Metaphone Index I describe above:</p><ol><li><strong>Get transcription metaphone: </strong>&quot;shadow champagne on&quot; yields <strong><code>XTXMPNN</code></strong></li><li><strong>Reverse-lookup: </strong> <strong><code>XTXMPNN</code> </strong>yields the record for Chateau Champignon.</li></ol><p><strong>Pros</strong></p><ul><li>Works with inaccurate transcriptions.</li><li>High <a href="https://en.wikipedia.org/wiki/Precision_and_recall?ref=vikramoberoi.com"><em>precision</em></a>: it tends to match the exact wine the user is searching for, and nothing else.</li></ul><p><strong>Cons</strong></p><ul><li>The user must say the entire wine as it exists in our index. If they just say &quot;champignon&quot; instead of &quot;chateau champignon&quot;, we are out of luck.</li><li>Fails when a transcription&apos;s Metaphone does not match. An example is the Metaphone for &quot;shadow shomp inyon&quot;: <strong><code>XTXMPNYN</code></strong><em>, </em>which has a &quot;Y&quot;.</li></ul><h3 id="3-metaphone-token-query">3.  Metaphone Token Query</h3><p>Here, we first form our query. Let&apos;s say the speaker is retrieving <em>Chateau la Bridane Saint Julien.</em></p><p>They say only &quot;julien bridane&quot;, but they incorrectly pronounce each word. It is transcribed as <code>&quot;julian breedan&quot;</code>.</p><ol><li><strong>Tokenize the transcript: </strong><code>&quot;julian breedan&quot;</code> becomes <code>[&quot;julian&quot;, &quot;breedan&quot;]</code></li><li><strong>Get the Metaphone for each token: </strong><code>[&quot;julian, &quot;breedan&quot;]</code>becomes <code>[&quot;<strong>JLN</strong>&quot;, &quot;<strong>BRTN</strong>&quot;]</code></li><li><strong>Join the Metaphones: </strong><code>[&quot;<strong>JLN</strong>&quot;, &quot;<strong>BRTN</strong>&quot;]</code>becomes <code>&quot;<strong>JLN</strong> <strong>BRTN</strong>&quot;</code></li></ol><p>Then we query the Metaphone Token Index <a href="#2-the-metaphone-token-index">described above</a> with our query.</p><p>Because our query contains <code>&quot;<strong>JLN</strong>&quot;</code> and <code>&quot;<strong>BRTN</strong>&quot;</code>, the wine  <em>Chateau la Bridane Saint Julien </em>will rank highly. Its tokens map to: <code>[&quot;<strong>XT</strong>&quot;, &quot;<strong>L</strong>&quot;, &quot;<strong><em>BRTN</em></strong>&quot;,  &quot;<strong>SNT</strong>&quot;, &quot;<strong><em>JLN</em></strong>&quot;]</code>.</p><p><em>Guy Breton Morgon </em>will also rank. Its tokens map to <code>[&quot;<strong>K</strong>&quot;, &quot;<strong><em>BRTN</em></strong>&quot;, &quot;<strong>MRKN</strong>&quot;]</code>.</p><p>Recall that we built this index using DuckDB&apos;s full-text search extension, so we query it with the extension too.</p><p><strong>Pros</strong></p><ul><li>Works with inaccurate transcriptions.</li><li>Works when speakers partially say the name of a wine.</li><li>Works when the order of the words spoken do not match the wine&apos;s record.</li></ul><p><strong>Cons</strong></p><ul><li>This approach sometimes yields many irrelevant results that have similar-sounding words in their names.</li><li>Fails when a transcription&apos;s Metaphones do not match.</li></ul><h3 id="4-similar-token-metaphones-query">4. &quot;Similar Token Metaphones&quot; Query</h3><p>Methods 1 to 3 handle:</p><ul><li>Accurate transcriptions</li><li>Inaccurate transcriptions with matching Metaphones</li></ul><p>This takes us far! But I would still get transcriptions without any matching Metaphones and I wanted to handle those.</p><p>I attempted a number of approaches, all using <a href="https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0?ref=vikramoberoi.com">Levenshtein edit distance</a> in creative ways on increasingly worse transcriptions. <strong><em>All of them worked.</em></strong></p><p>The catch? They also introduced a lot of noise: the worse a transcription, the more permissive I had to be, which meant that my result set would also include lots of wines that looked and sounded nothing like the wine I was retrieving.</p><p>In information retrieval parlance, my search <a href="https://en.wikipedia.org/wiki/Precision_and_recall?ref=vikramoberoi.com"><em>recall</em></a><em> </em>suffered: I was retrieving an ever-higher percentage of irrelevant records.</p><p>I had a prototype to ship, so I settled with a basic method that:</p><ol><li>Didn&apos;t return a torrent of random records.</li><li>Could handle one extra, absent, or incorrect consonant that caused a token to generate a different Metaphone. For example, &quot;bribane&quot; maps to B<em>RBN</em>, while &quot;bridane&quot; &#x2013; a token in our index &#x2013; maps to <em>BRTN.</em></li></ol><p>Here&apos;s how it works. Let&apos;s say the speaker is retrieving <em>Chateau la Bridane Saint Julien </em>(Metaphones: <strong><code>XT L BRTN SNT JLN</code></strong><em>).</em></p><p>They say &quot;bridane saint julien&quot;, but we get a really bad transcription: <code>&quot;brenton sane julia&quot;</code>, whose tokens map to <code>[&quot;<strong>BRNTN</strong>&quot;, &quot;<strong>SN</strong>&quot;, &quot;<strong>JL</strong>&quot;]</code>.</p><p>We&apos;re out of luck here with methods 1 to 3. Each will return zero results because we have no matching tokens nor matching Metaphones.</p><p>A &quot;Similar Token Metaphone&quot; query does the following:</p><ol><li><strong>Tokenize the transcription: </strong><code>&quot;brenton sane julia&quot;</code> turns into <code>[&quot;brenton&quot;, &quot;sane&quot;, &quot;julia&quot;]</code>.</li><li><strong>Get all tokens in our index that are exactly 1 edit distance away from each token in our query: </strong>for <code>&quot;brenton&quot;</code> this is <code>[&quot;breton&quot;]</code>, for <code>&quot;sane&quot;</code> this is <code>[]</code> (an empty set), for <code>&quot;julia&quot;</code> this is <code>[]</code> (an empty set). So we get <code>[&quot;breton&quot;]</code>.</li><li><strong>Join all tokens from step 2: </strong><code>[&quot;breton&quot;]</code> becomes <code>&quot;breton&quot;</code>. If we had more than one token, we&apos;d join all of them with a space.</li><li><strong>Use the string from step 3 to run a Metaphone Token Query. </strong>Our input to the query method <a href="#3-metaphone-token-query">described above</a> is <code>&quot;breton&quot;</code>.</li></ol><p>Because the Metaphone for <code>&quot;breton&quot;</code> is <strong><code>BRTN</code></strong> we get two records, one of which is our target wine:</p><ul><li><em>Chateau la Bridane Saint Julien </em>(Metaphones: <strong><code>XT L BRTN SNT JLN</code></strong><em>)</em></li><li><em>Guy Breton Morgon </em>(Metaphones: <strong><code>K BRTN MRKN</code></strong><em>)</em></li></ul><p>Using a Metaphone Token Query in Step 4 instead of a DuckDB Full-Text Query casts a slightly wider net, which increases the likelihood that we catch the result we want when all our other query methods have failed.</p><p>In my limited observation, it worked spookily well without returning lots of irrelevant results. It is a total delight to read an awful transcription like &quot;brenton sane julia&quot; after you speak yet have it retrieve the record you&apos;re looking for.</p><p><strong>Pros</strong></p><ul><li>Might succeed even when there are no matching Metaphones in a transcription.</li></ul><p><strong>Cons</strong></p><ul><li>Might return a lot of irrelevant results.</li></ul><h3 id="5-metaphone-substring-query">5. &quot;Metaphone Substring&quot; Query</h3><p>I added this final query approach to handle exactly one class of transcription error: transcriptions that break one word into multiple words. This happened fairly frequently.</p><p>Let&apos;s say the speaker is retrieving <em>Chateau Champignon, </em>they say &quot;champignon&quot;, and the transcription we receive is <code>&quot;champ onion&quot;</code>.</p><p>Query approaches 1 through 4 all fail:</p><ol><li>There is no wine that matches the tokens <code>&quot;champ&quot;</code> or <code>&quot;onion&quot;</code>.</li><li>The Exact Metaphone for <code>&quot;champ onion&quot;</code> is <strong><code>XMPNN</code></strong>, for which there is no match in the Exact Metaphone Index.</li><li>The Metaphone Tokens for <code>&quot;champ onion&quot;</code> are <strong><code>XMP</code></strong> and <strong><code>ONN</code></strong>, for which there are no matches in the Metaphone Token Index.</li><li>There are no tokens 1 edit distance away from <code>&quot;champ&quot;</code> and <code>&quot;onion&quot;</code></li></ol><p>It turns out that many of these cases have Exact Metaphones that are substrings of their target wine. In this case:</p><ul><li>The Exact Metaphone for <code>&quot;champ onion&quot;</code> is <strong><code>XMPNN</code></strong>.</li><li>The Exact Metaphone for <code>&quot;chateau champignon&quot;</code><em> </em>is <strong><code>XT<em>XMPNN</em></code><em>.</em></strong></li></ul><p>... so a substring search in our Exact Metaphone Index succeeds where all our other query methods have failed.</p><p><strong>Pros</strong></p><ul><li>It&apos;s simple.</li><li>Handles a transcription error that happens frequently.</li><li>Doesn&apos;t return a lot of irrelevant results.</li></ul><p><strong>Cons</strong></p><ul><li>Doesn&apos;t help with anything other than this one class of error.</li><li>Doesn&apos;t always work: &quot;shomp inyon&quot; maps to &quot;XMPNYN&quot;.</li></ul><h2 id="ranking-results-from-all-five-approaches-using-jaro-winkler-similarity">Ranking results from all five approaches using Jaro-Winkler similarity</h2><p>When we execute all five of our query approaches:</p><ul><li>We don&apos;t know which approach, if any, has the result the speaker is looking for.</li><li>Each approach might return multiple results.</li></ul><p>So how do we surface the target wine from all these results?</p><p>It&apos;s safe to assume that the transcript we receive is similar, by some measure, to the speaker&apos;s target wine. If it&apos;s not, then the transcript is probably too inaccurate &#x2013; would a human be able to figure out which wine the sommelier is referring to?</p><p>The challenge here is determining an objective measure that consistently brings the target wine into our top 5 results. Doing this well requires data from real users and time to experiment, neither of which we had.</p><p>Aiming for a quick solution, I tried using text similarity using <a href="https://duckdb.org/docs/sql/functions/char?ref=vikramoberoi.com#text-similarity-functions">whatever was available off-the-shelf using DuckDB.</a></p><p>It turns out that Jaro-Winkler similarity works well enough for our purposes.</p><p>My implementation:</p><ol><li>Gets up to 10 results from every query approach.</li><li>De-duplicates the set of results, because the same result might appear from multiple approaches.</li><li>Calculates Jaro-Winkler similarity between each result and the transcription.</li><li>Returns the top 5 ranked results.</li></ol><p>This is a workable implementation, but if you play around with the demo you&apos;ll observe it fail in basic ways, like ranking the target wine 5th when it seems implausible that the 1st, 2nd, 3rd, or 4th results are correct.</p><p>The problem with using text similarity is that we&apos;re discarding everything we know about our query approaches and the kinds of results they return. For example, if we get results from a DuckDB full-text search, it means the transcript contains correctly-spelled tokens in our search index, which increases the probability that we have an accurate transcript. That&apos;s information we could use to rank our search results!</p><h2 id="how-can-this-be-improved">How can this be improved?</h2><p>With lots of experimentation! It helps enormously to have data from real users searching for wines.</p><p>In a real setting, I&apos;d want to:</p><ul><li>Know what percentages of searches are successful over time.</li><li>Have a set of (transcript, target wine) pairs that I can test against.</li><li>Run A/B tests pitting search algorithms against each other.</li></ul><p>Alas, <a href="#appendix">we did not promote this prototype to production</a>.</p><p>If I spent more time on this voice search implementation, here are some of the things I might try.</p><p><strong>Ranking results: scores, weights, and phonetic algorithms</strong></p><ul><li>Weight results from query approaches differently.</li><li>Come up with a way to score results phonetically.</li><li>Give higher scores to results if they surface in multiple approaches.</li><li>Use different, or multiple phonetic algorithms for the task.</li></ul><p><strong>Using LLMs</strong></p><ul><li>Use an LLM to surface the likeliest wines. I tried this briefly and couldn&apos;t get it to work well, but it has promise.</li><li>Use an LLM to generate plausible mis-transcriptions of every wine. (GPT-4 generates great, plausible transcriptions). Index all of these.</li><li>Use the mis-transcriptions from the above to fine-tune an LLM to do this task. </li></ul><h2 id="appendix">Appendix</h2><h3 id="is-this-product-live-if-not-will-it-be">Is this product live? If not, will it be?</h3><p>No, and probably not. Everyone hates inventory counts, but this was not a pressing enough problem for the GM at Ristorante or the 10 wine stores, meat/cheese shops, and restaurants I spoke with.</p><p>In the time following the week we spent building our initial prototype, we did not gather enough evidence for us to continue working on this product.</p><h3 id="can-you-build-this-or-something-else-for-us">Can you build this, or something else, for us?</h3><p>Yes.</p><p>I co-own <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>, a product development and contracting company, with Gabriel Garrido. We&apos;re a two-person team of seasoned engineers and we&apos;ve built software and teams at numerous early stage startups.</p>
<!--kg-card-begin: html-->
<span style="font-size: 0.9em; font-style: italic">Thanks to Gabriel Garrido, Phil Eaton, Pierre Jambet, Andy O&apos;Neill, Chuck Groom, and Aaron Rosen.</span>
<!--kg-card-end: html-->
<hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[Funnel analysis in SQL using window functions, range frames, and regular expressions]]></title><description><![CDATA[<p>This is an external post I wrote for my company, <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>. Click on the link to read it on the Baxter blog.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://baxterhq.com/blog/funnel-analysis-in-sql-using-window-functions-range-frames-and-regular-expressions/?ref=vikramoberoi.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Funnel analysis in SQL using window functions, range frames, and regular expressions</div><div class="kg-bookmark-description">Last week we released Baxter&#x2019;s free funnel analysis SQL generator supporting nine different SQL</div></div></a></figure>]]></description><link>https://www.vikramoberoi.com/funnel-analysis-in-sql-using-window-functions-range-frames-and-regular-expressions/</link><guid isPermaLink="false">63f67ed8dae44f004d0c64fb</guid><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Wed, 22 Feb 2023 20:46:02 GMT</pubDate><content:encoded><![CDATA[<p>This is an external post I wrote for my company, <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>. Click on the link to read it on the Baxter blog.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://baxterhq.com/blog/funnel-analysis-in-sql-using-window-functions-range-frames-and-regular-expressions/?ref=vikramoberoi.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Funnel analysis in SQL using window functions, range frames, and regular expressions</div><div class="kg-bookmark-description">Last week we released Baxter&#x2019;s free funnel analysis SQL generator supporting nine different SQL dialects. You can read about how the tool works and the assumptions we make about your data in this introductory blog post. We encourage you to copy/paste the queries our tool generates for your use</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://baxterhq.com/blog/content/images/size/w256h256/format/jpeg/2023/01/profile-picture.jpg" alt><span class="kg-bookmark-author">Baxter Blog</span><span class="kg-bookmark-publisher">Vikram Oberoi</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://baxterhq.com/blog/content/images/2023/02/sql-witch-sq.png" alt></div></a></figure>]]></content:encoded></item><item><title><![CDATA[Baxter’s free funnel analysis SQL generation tool is out!]]></title><description><![CDATA[<p>This is an external post I wrote for my company, <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>. Click on the link to read it on the Baxter blog.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://baxterhq.com/blog/baxters-free-funnel-analysis-sql-generation-tool-is-out/?ref=vikramoberoi.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Baxter&#x2019;s free funnel analysis SQL generation tool is out!</div><div class="kg-bookmark-description">Today we&#x2019;re releasing a free tool that generates multi-step funnel analysis queries in nine SQL</div></div></a></figure>]]></description><link>https://www.vikramoberoi.com/baxters-free-funnel-analysis-sql-generation-tool-is-out/</link><guid isPermaLink="false">63f67e11dae44f004d0c64e5</guid><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Mon, 30 Jan 2023 20:44:00 GMT</pubDate><content:encoded><![CDATA[<p>This is an external post I wrote for my company, <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>. Click on the link to read it on the Baxter blog.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://baxterhq.com/blog/baxters-free-funnel-analysis-sql-generation-tool-is-out/?ref=vikramoberoi.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Baxter&#x2019;s free funnel analysis SQL generation tool is out!</div><div class="kg-bookmark-description">Today we&#x2019;re releasing a free tool that generates multi-step funnel analysis queries in nine SQL dialects: BigQuery, Snowflake, Redshift, Spark, ClickHouse, DuckDB, PostgreSQL, MySQL, and SQLite.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://baxterhq.com/blog/content/images/size/w256h256/format/jpeg/2023/01/profile-picture.jpg" alt><span class="kg-bookmark-author">Baxter Blog</span><span class="kg-bookmark-publisher">Vikram Oberoi</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://baxterhq.com/blog/content/images/2023/01/v-funnel-crank-2.png" alt></div></a></figure>]]></content:encoded></item><item><title><![CDATA[Introducing Baxter: deep-dive event analytics on your data warehouse]]></title><description><![CDATA[<p>This is an external post I wrote for my company, <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>. Click on the link to read it on the Baxter blog.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://baxterhq.com/blog/introducing-baxter-deep-dive-event-analytics-on-your-data-warehouse/?ref=vikramoberoi.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Introducing Baxter: deep-dive event analytics on your data warehouse</div><div class="kg-bookmark-description">Analysts are underserved by existing event analytics tools. We&#x2019;re building an alternative.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://baxterhq.com/blog/content/images/size/w256h256/format/jpeg/2023/01/profile-picture.jpg" alt><span class="kg-bookmark-author">Baxter Blog</span><span class="kg-bookmark-publisher">Vikram Oberoi</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://baxterhq.com/blog/content/images/2023/01/demo-poster-6.jpg" alt></div></a></figure>]]></description><link>https://www.vikramoberoi.com/introducing-baxter-deep-dive-event-analytics-on-your-data-warehouse/</link><guid isPermaLink="false">63c70ca97cf2ea003daa5693</guid><category><![CDATA[baxter]]></category><category><![CDATA[funnels]]></category><category><![CDATA[cohorts]]></category><category><![CDATA[product-analytics]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Tue, 17 Jan 2023 21:04:58 GMT</pubDate><content:encoded><![CDATA[<p>This is an external post I wrote for my company, <a href="https://baxterhq.com/?ref=vikramoberoi.com">Baxter</a>. Click on the link to read it on the Baxter blog.</p><figure class="kg-card kg-bookmark-card"><a class="kg-bookmark-container" href="https://baxterhq.com/blog/introducing-baxter-deep-dive-event-analytics-on-your-data-warehouse/?ref=vikramoberoi.com"><div class="kg-bookmark-content"><div class="kg-bookmark-title">Introducing Baxter: deep-dive event analytics on your data warehouse</div><div class="kg-bookmark-description">Analysts are underserved by existing event analytics tools. We&#x2019;re building an alternative.</div><div class="kg-bookmark-metadata"><img class="kg-bookmark-icon" src="https://baxterhq.com/blog/content/images/size/w256h256/format/jpeg/2023/01/profile-picture.jpg" alt><span class="kg-bookmark-author">Baxter Blog</span><span class="kg-bookmark-publisher">Vikram Oberoi</span></div></div><div class="kg-bookmark-thumbnail"><img src="https://baxterhq.com/blog/content/images/2023/01/demo-poster-6.jpg" alt></div></a></figure>]]></content:encoded></item><item><title><![CDATA[I don’t like any of the funnel or cohort analysis tools available to me]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/12/IMG_3505.JPG" class="kg-image" alt="Three funnels upside down decorated with googly eyes, pom-poms, and pipe cleaners to look like people. Behind them is a lunch box robot and a small diorama in a crate with a bunch of small ceramic friars and a disco ball." loading="lazy" width="2000" height="1500" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/12/IMG_3505.JPG 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/12/IMG_3505.JPG 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/12/IMG_3505.JPG 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2022/12/IMG_3505.JPG 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">I do like these silly funnels, though.=</span></figcaption></figure><p>Ten years ago, while starting the Data Engineering team at <a href="https://www.harrys.com/?ref=vikramoberoi.com">Harry&#x2019;s</a>, I wrote a suite of cohort analysis queries that looked more-or-less like <a href="https://www.lennysnewsletter.com/i/70174617/getting-retention-via-sql?ref=vikramoberoi.com">the queries in this post.</a></p><p>The success of the razor and blades model hinges on high customer retention</p>]]></description><link>https://www.vikramoberoi.com/i-dont-like-any-of-the-funnel-or-cohort-analysis-tools-available-to-me/</link><guid isPermaLink="false">639a69251e03c5003dc7ea04</guid><category><![CDATA[product-analytics]]></category><category><![CDATA[funnels]]></category><category><![CDATA[cohorts]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Tue, 27 Dec 2022 16:25:34 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.41.52-1.png" medium="image"/><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/12/IMG_3505.JPG" class="kg-image" alt="I don&#x2019;t like any of the funnel or cohort analysis tools available to me" loading="lazy" width="2000" height="1500" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/12/IMG_3505.JPG 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/12/IMG_3505.JPG 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/12/IMG_3505.JPG 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2022/12/IMG_3505.JPG 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">I do like these silly funnels, though.=</span></figcaption></figure><img src="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.41.52-1.png" alt="I don&#x2019;t like any of the funnel or cohort analysis tools available to me"><p>Ten years ago, while starting the Data Engineering team at <a href="https://www.harrys.com/?ref=vikramoberoi.com">Harry&#x2019;s</a>, I wrote a suite of cohort analysis queries that looked more-or-less like <a href="https://www.lennysnewsletter.com/i/70174617/getting-retention-via-sql?ref=vikramoberoi.com">the queries in this post.</a></p><p>The success of the razor and blades model hinges on high customer retention so we tracked it religiously from day one. Retention was so critical at Harry&#x2019;s that I even wrote a haiku about it once.*</p><p>I wrote 15-20 permutations of cohort queries that we would analyze regularly. At Harry&#x2019;s we cut our cohorts in a bunch of ways: the channel customers came from, the contents of their first order, age/gender, geography, etc. We analyzed account creation -&gt; 1st order, 1st order -&gt; 2nd order, 2nd order -&gt; 3rd order separately.</p><p>There were <em>a lot</em> of charts. We wanted to compare them and see how they changed week-over-week and month-over-month.</p><p>I could have sent our data to <a href="https://www.mixpanel.com/?ref=vikramoberoi.com">Mixpanel</a> at the time, but it didn&#x2019;t allow us to explore our data this way. So I hand-rolled a lot of SQL queries and my colleague analyzed the data in Excel. It was cumbersome and time-consuming, but it was a critical analysis for the business.</p><p>This year I was retained as a fractional Head of Product for a long-time client and this summer I took a few days to dive into my client&#x2019;s retention metrics. They have a loyal base of regular users and I wanted to figure out how they differed from folks who churned. So I reached for my trusty friend: the cohort analysis!</p><p>Imagine my surprise when my options were no different than they were a decade ago:</p><ol><li>I can send my data to tool like <a href="https://www.amplitude.com/?ref=vikramoberoi.com">Amplitude</a>, <a href="https://www.mixpanel.com/?ref=vikramoberoi.com">Mixpanel</a>, or <a href="https://www.heapanalytics.com/?ref=vikramoberoi.com">Heap</a> and use its tools.</li><li>I can write SQL and visualize data in Excel or a BI tool.</li></ol><p>I didn&#x2019;t like either of these options ten years ago! I still don&#x2019;t.</p><p>While product analytics suites provide useful tools for exploratory analyses, I usually need to drop down to SQL to answer common follow-up questions. And ensuring accuracy is challenging: the data that ends up in these tools is frequently suspect because event instrumentation is so brittle. Last week I came across Olga Berezovsky&#x2019;s <a href="https://dataanalysis.substack.com/p/growth-loops-and-some-hard-truths?ref=vikramoberoi.com">feedback on Amplitude after its most recent conference</a> and it resonates &#x2014; Amplitude and its competitors don&#x2019;t serve me well when I&#x2019;m doing deeper analyses and I can&#x2019;t rely on their accuracy.</p><p>Meanwhile, I love the flexibility I have when I&#x2019;m writing SQL: I can query my data however I&apos;d like to. I can clean it up, run spot checks, and write data quality tests to make sure my analysis is correct. But SQL queries for funnel and cohort analyses are challenging to write correctly and scale. Exploring funnels and cohorts this way &#x2014; asking follow-up questions like, &#x201C;Why is our March cohort so much more active than our February cohort?&#x201D; &#x2014; is time-consuming and I miss out on the interactivity I get from product analytics tools.</p><p>I want to be able to analyze behavioral data with the affordances that existing product analytics tools provide. But I don&#x2019;t want to be shackled to those tools.</p><p>The tool I want sits in a happy middle:</p><ul><li><strong>It&#x2019;s interactive, like product analytics tools.</strong> I want to select parameters for my funnels and cohorts and get charts. I want to click around to answer follow-up questions.</li><li><strong>I can use the data I have.</strong> I don&#x2019;t want to send the data to a third-party. I want to use the data I have in my application database, data warehouse, or on my local machine.</li><li><strong>It&#x2019;s SQL-native and integrates with my workflows.</strong> I want to generate queries for 90+% of the charts I explore so I can use them in my BI tool, data pipeline, or notebook.</li><li><strong>I can go deep without having to write SQL.</strong> I want to be able to answer questions and explore data well beyond the point that product analytics tools fall short.</li></ul><p>I want this to exist! So I&apos;m working on it.</p><p>If you analyze behavioral data and the above resonates, I would be delighted to speak with you. <a href="https://www.vikramoberoi.com/contact">Please say hello!</a></p><p>*You can read about <a href="https://www.sandwichesimade.org/2019/10/24/a-recreation-of-honeyhole-s-el-guapo-sandwich.html?ref=vikramoberoi.com">my pursuit of poetry in the workplace</a> on my secret sandwich blog. (Currently on hiatus.)</p><hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[Using Polars on results from DuckDB's Arrow interface in Rust]]></title><description><![CDATA[<p>I found myself wanting to manipulate/compute a large <a href="https://arrow.apache.org/?ref=vikramoberoi.com">Arrow</a> result set from a <a href="https://duckdb.org/?ref=vikramoberoi.com">DuckDB</a> query in Rust. I first wrote code to iterate over these results and compute what I needed, but the result was a <em>lot</em> of code that ran slowly and was cumbersome to write. I decided</p>]]></description><link>https://www.vikramoberoi.com/using-polars-on-results-from-duckdbs-arrow-interface-in-rust/</link><guid isPermaLink="false">638ba85e1816ae003d33b6dd</guid><category><![CDATA[til]]></category><category><![CDATA[rust]]></category><category><![CDATA[polars]]></category><category><![CDATA[duckdb]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Sun, 04 Dec 2022 21:25:19 GMT</pubDate><content:encoded><![CDATA[<p>I found myself wanting to manipulate/compute a large <a href="https://arrow.apache.org/?ref=vikramoberoi.com">Arrow</a> result set from a <a href="https://duckdb.org/?ref=vikramoberoi.com">DuckDB</a> query in Rust. I first wrote code to iterate over these results and compute what I needed, but the result was a <em>lot</em> of code that ran slowly and was cumbersome to write. I decided to reach for <a href="https://www.pola.rs/?ref=vikramoberoi.com">Polars</a> instead.</p><p>The end result is way less code that is much more performant. I also have the Polars API to work with on any result set from DuckDB, which lets me iterate more quickly. It&apos;s handy!</p><p>That said, this was really painful for me to figure out. There are a lot of documentation gaps in these projects, I still don&apos;t quite understand the Arrow APIs, and I&apos;m new to Rust. If you&apos;re looking for a shortcut, I published a <a href="https://github.com/voberoi/duckdb-polars-rust?ref=vikramoberoi.com">Github repo showing how you can glue these APIs together</a>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/12/polar-duck-for-v.png" class="kg-image" alt="A duck giving a polar bear an arrow. The bear is confused." loading="lazy" width="2000" height="1622" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/12/polar-duck-for-v.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/12/polar-duck-for-v.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/12/polar-duck-for-v.png 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2022/12/polar-duck-for-v.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">This whole situation is very confusing. credit: </span><a href="https://www.katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><p>If you just want to use these APIs, skip the rest of this post and visit the repo. I share more around how these APIs work, my journey to figuring it out, and questions that came up below.</p><p><strong>Querying DuckDB</strong></p><p>I started out using <a href="https://crates.io/crates/duckdb?ref=vikramoberoi.com">duckdb-rs</a>, &quot;... an ergonomic wrapper for using <a href="https://github.com/duckdb/duckdb?ref=vikramoberoi.com" rel="nofollow noopener noreferrer">duckdb</a> from Rust. It attempts to expose an interface similar to <a href="https://github.com/rusqlite/rusqlite?ref=vikramoberoi.com" rel="nofollow noopener noreferrer">rusqlite</a>.&quot; The API is friendly, but as of this post&apos;s publishing, <code>duckdb-rs</code> <a href="https://github.com/wangfenjin/duckdb-rs/issues/81?ref=vikramoberoi.com" rel="nofollow noopener noreferrer">doesn&apos;t support any nested data types</a>, which I needed (my query results include <a href="https://duckdb.org/docs/sql/data_types/list.html?ref=vikramoberoi.com" rel="nofollow noopener noreferrer">lists</a>). I looked into submitting a PR to support nested data types but my inexperience with Rust quickly put that to a halt.</p><p>The library code also starts from a fork of <code>rusqlite</code> and I think that makes things a bit challenging. SQLite does not support nested data types as far as I can tell (and I don&apos;t see evidence of any such support in <code>rusqlite</code>). Reading through the <code>duckdb-rs</code> source, it seemed like one would need to shoehorn first class support for nested data types into the library or redesign a chunk of the library&apos;s core.</p><p>I, uh, am not comfortable enough with Rust to attempt that in any reasonable amount of time.</p><p>Thankfully, the author of <code>duckdb-rs</code> also makes <a href="https://crates.io/crates/libduckdb-sys?ref=vikramoberoi.com">libduckdb_sys</a> available: Rust bindings to <a href="https://duckdb.org/docs/api/c/api?ref=vikramoberoi.com">DuckDB&apos;s C API</a>. And there&apos;s a <a href="https://github.com/wangfenjin/duckdb-rs/blob/main/libduckdb-sys/src/lib.rs?ref=vikramoberoi.com#L51">bunch of example code</a> for how to use it in the <code>duckdb-rs</code> source.</p><p>This involves running a bunch of unsafe functions and dereferencing raw pointers, so you need to wrap a lot of this code in an <code>unsafe</code> block. Reading results from DuckDB&apos;s Arrow interface involves two steps.</p><p>First, execute the query using the Arrow interface:</p><pre><code class="language-rust">let mut result: duckdb_arrow = ptr::null_mut();
let state = duckdb_query_arrow(conn, sql.as_ptr(), &amp;mut result);

// An example of error handling with this API. I&apos;ll skip this everywhere else.
if state == duckdb_state_DuckDBError {
    let error_message: *const c_char = duckdb_query_arrow_error(result);
    let error_message = CStr::from_ptr(error_message).to_str().unwrap();
    panic!(&quot;{}&quot;, error_message);
}</code></pre><p>... then fetch batches of results:</p><pre><code class="language-rust">let mut ffi_arrow_array: arrow2::ffi::ArrowArray = arrow2::ffi::ArrowArray::empty();
let state = duckdb_query_arrow_array(
    result,
    &amp;mut &amp;mut ffi_arrow_array as *mut _ as *mut *mut c_void, // I don&apos;t get this. I got it from duckdb-rs.
);</code></pre><p>I do not understand why <code>ffi_arrow_array</code> needs to be cast this way. I do understand:</p><ul><li>The <a href="https://duckdb.org/docs/api/c/api?ref=vikramoberoi.com#duckdb_query_arrow_array">C API</a> requires a <code>duckdb_arrow_array *</code> for that second argument.</li><li><code>duckdb_arrow_arrow</code> is an alias for <code>void *</code> (this is in the <a href="https://github.com/duckdb/duckdb/blob/master/src/include/duckdb.h?ref=vikramoberoi.com">DuckDB C API headers</a>).</li><li>The argument it expects, then, is a <code>void **</code>.</li></ul><p>But the cast in Rust is baffling to me. What is <code>as *mut _</code> doing in the middle of this? What does this do and why is it necessary? If you know, <a href="https://twitter.com/voberoi?ref=vikramoberoi.com">please tell me</a>.</p><p>On another note, piecing all this together is painful. I had to read <a href="https://duckdb.org/docs/api/c/api?ref=vikramoberoi.com#duckdb_query_arrow_array">DuckDB&apos;s C API</a> docs, <a href="https://github.com/duckdb/duckdb/blob/master/src/include/duckdb.h?ref=vikramoberoi.com">DuckDB&apos;s C header</a>, and the <a href="https://docs.rs/arrow2/latest/arrow2/?ref=vikramoberoi.com">arrow2 crate docs</a> to figure out why this particular incantation works. I am not sure there is a good solution to this. These are disparate, relatively immature projects. Maybe it just takes time? (And blog posts like these?)</p><p><strong>Turning an <code>arrow2::ffi::ArrowArray</code> into a Polars DataFrame</strong></p><p>With the code above we&apos;ve got a batch of results from DuckDB populated <a href="https://arrow.apache.org/docs/format/CDataInterface.html?ref=vikramoberoi.com#structure-definitions">in these Arrow C structs</a>. But we&apos;d like to go from these to a Polars DataFrame. How?</p><p>I used the <code>arrow2</code> crate instead of <code>arrow</code> because it looks like Polars uses <code>arrow2</code> and figured I might have an easier on-ramp to creating a Polars DataFrame.</p><p>We need to do the following:</p><ul><li>Create a Rust-native Arrow array so we stop working with these C structs.</li><li>Create a Polars Series for each column in our result set from the Arrow array.</li><li>Create a Polars DataFrame from those Series.</li></ul><p>To start with, we also need to get the Arrow array&apos;s schema. DuckDB provides an API for this too:</p><pre><code class="language-rust">let mut schema = arrow2::ffi::ArrowSchema::empty();
let state =
    duckdb_query_arrow_schema(result, &amp;mut &amp;mut schema as *mut _ as *mut *mut c_void);</code></pre><p>(There&apos;s that baffling cast again.)</p><p>Now we&apos;re going to exit C API-land and convert these raw C structs into a Rust-native <code>arrow2::array::Array</code>.</p><pre><code class="language-rust">let field = arrow2::ffi::import_field_from_c(schema).unwrap();
let arrow_array =
    arrow2::ffi::import_array_from_c(ffi_arrow_array, field.data_type).expect(&quot;ok&quot;);</code></pre><p><code>arrow_array</code> is an <code>arrow2::array::Array</code>, which is a trait object that can be downcast to the specific Arrow array type you want. I don&apos;t know what a Rust trait object is yet, but I do know I want to iterate over an array of integers or whatever my query actually returns.</p><p>Let&apos;s assume this is our query:</p><pre><code class="language-sql">CREATE TABLE users (
    id INTEGER,
    username VARCHAR
);

-- Insert a bunch of data --

SELECT id, username FROM users;</code></pre><p>It turns out that the <code>arrow_array</code> DuckDB&apos;s Arrow interface returns is a <a href="https://docs.rs/arrow2/latest/arrow2/array/struct.StructArray.html?ref=vikramoberoi.com"><code>StructArray</code></a>, which is just a struct that contains multiple Arrow arrays of the same length. For our query above, it&apos;ll be a <code>StructArray</code> containing two arrays: one for <code>id</code> and one for <code>username</code>.</p><p>So we first downcast to a <code>StructArray</code>:</p><pre><code class="language-rust">let struct_array = arrow_array
    .as_any()
    .downcast_ref::&lt;StructArray&gt;()
    .expect(&quot;This Arrow Array should be a StructArray.&quot;);</code></pre><p>Then we downcast each array in <code>struct_array</code> to their appropriate types. This was another area that took a while to grok. What are the types DuckDB returns in the <code>struct_array</code>? Are those user IDs signed or unsigned? Are they 32-bit or 64-bit?</p><p>Print out <code>struct_array.fields()</code> to view the data types you&apos;re getting back. If you wanted to construct a DataFrame dynamically from any query, you&apos;d have to inspect these fields then downcast each array to the corresponding type.</p><p>Here&apos;s what the code looks like to fetch and downcast to the specific Arrow array types we need for the query above:</p><pre><code class="language-rust">let id_array = struct_array.values()[0]
    .as_any()
    .downcast_ref::&lt;Int32Array&gt;()
    .unwrap();

let username_array = struct_array.values()[1]
    .as_any()
    .downcast_ref::&lt;Utf8Array&lt;i32&gt;&gt;()
    .unwrap();</code></pre><p><code>Int32Array</code> is an alias provided by <code>arrow2</code> for a <code>PrimitiveArray</code> containing <code>i32</code>-type data. All primitive types are stored in Arrow arrays of types <code>PrimitiveArray</code>.</p><p><code>Utf8Array</code> is not an alias. And I think the generic argument to <code>Utf8Array</code> is the type used for Arrow offsets. Specifically, these are the offsets used to identify where a UTF-8 string at a given index might start. If you have the array [&quot;hello&quot;, &quot;world&quot;], the offset for &quot;hello&quot; is <code>0</code> and the offset of &quot;world&quot; is <code>5</code>.</p><p>(If that is incorrect, <a href="https://twitter.com/voberoi?ref=vikramoberoi.com">holler</a>.)</p><p>A good place to start to read more about these type is the docs for <a href="https://docs.rs/arrow2/0.1.0/arrow2/array/index.html?ref=vikramoberoi.com"><code>arrow2::array</code></a>.</p><p>Finally, we want to create Series from these arrays:</p><pre><code class="language-rust">let id_series = Series::try_from((&quot;id&quot;, id_array.to_boxed())).unwrap();
let username_series =
    Series::try_from((&quot;username&quot;, username_array.to_boxed())).unwrap();</code></pre><p><code>to_boxed()</code> turns the downcasted arrays into a boxed <code>arrow2::array::Array</code>, the trait object all Arrow arrays are represented as in <code>arrow2</code>. This what Polars <code>Series</code> expects.</p><p>Finally, our dataframe:</p><pre><code class="language-rust">let df = DataFrame::new(vec![id_series, username_series]).unwrap();</code></pre><p>... and that&apos;s how you start with a DuckDB query and end up with a Polars dataframe in Rust.</p><p><strong>Further reading</strong></p><ul><li>I&apos;ve put together a full example of the above in <a href="https://github.com/voberoi/duckdb-polars-rust?ref=vikramoberoi.com">on Github</a>. It uses a Parquet file from the <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page?ref=vikramoberoi.com">NYC TLC dataset</a>.</li><li><a href="https://duckdb.org/docs/api/c/api?ref=vikramoberoi.com">DuckDB C API</a></li><li><a href="https://github.com/duckdb/duckdb/blob/master/src/include/duckdb.h?ref=vikramoberoi.com">duckdb.h</a></li><li><a href="https://docs.rs/arrow2/latest/arrow2/?ref=vikramoberoi.com">Arrow2</a></li><li><a href="https://duckdb.org/2021/12/03/duck-arrow.html?ref=vikramoberoi.com">This post on DuckDB&apos;s Arrow integration is interesting.</a></li></ul><hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[Compilers with David Beazley: a recursive descent into madness (and delight)]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/11/compiler-cat-color.png" class="kg-image" alt loading="lazy" width="2000" height="2075" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/11/compiler-cat-color.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/11/compiler-cat-color.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/11/compiler-cat-color.png 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2022/11/compiler-cat-color.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A cat taking a compilers course. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a><span style="white-space: pre-wrap;">.</span></figcaption></figure><p>I got carried away with that title, sorry.</p><p>I wrapped up <a href="https://dabeaz.com/?ref=vikramoberoi.com">David Beazley</a>&apos;s week-long <a href="https://dabeaz.com/compiler.html?ref=vikramoberoi.com">compilers course</a> yesterday and it was a total hoot. If you&apos;ve been considering taking this course, I want to share my (lovely) experience</p>]]></description><link>https://www.vikramoberoi.com/compilers-with-david-beazly-a-recursive-descent-into-madness-and-delight/</link><guid isPermaLink="false">63781d4bfe609c003da60f30</guid><category><![CDATA[compilers]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Sat, 19 Nov 2022 21:06:56 GMT</pubDate><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/11/compiler-cat-color.png" class="kg-image" alt loading="lazy" width="2000" height="2075" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/11/compiler-cat-color.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/11/compiler-cat-color.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/11/compiler-cat-color.png 1600w, https://www.vikramoberoi.com/content/images/size/w2400/2022/11/compiler-cat-color.png 2400w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A cat taking a compilers course. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a><span style="white-space: pre-wrap;">.</span></figcaption></figure><p>I got carried away with that title, sorry.</p><p>I wrapped up <a href="https://dabeaz.com/?ref=vikramoberoi.com">David Beazley</a>&apos;s week-long <a href="https://dabeaz.com/compiler.html?ref=vikramoberoi.com">compilers course</a> yesterday and it was a total hoot. If you&apos;ve been considering taking this course, I want to share my (lovely) experience and what you can expect from it.</p><p>David&apos;s compilers class is an intense, hands-on primer to implementing a programming language. Over five days, we built an interpreter and compiler for a toy language called Wabbit from scratch. David designed the language and the spec is small enough to get something cool working while having <em>gobs</em> to wrap up and explore if you choose to.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/11/Screen-Shot-2022-11-19-at-10.08.27-AM.png" class="kg-image" alt loading="lazy" width="1522" height="1426" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/11/Screen-Shot-2022-11-19-at-10.08.27-AM.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/11/Screen-Shot-2022-11-19-at-10.08.27-AM.png 1000w, https://www.vikramoberoi.com/content/images/2022/11/Screen-Shot-2022-11-19-at-10.08.27-AM.png 1522w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">It was thrilling to be able to compile and run Wabbit code that prints out the Mandelbrot set!</span></figcaption></figure><p>David&apos;s done an outstanding job designing the project for this class. It was challenging but I learned a ton and always had something to show for my effort. Every day we&apos;d have working code for some practical aspect of Wabbit that also set us up the next day. It was a <em>great</em> way to build momentum.</p><p>For example, on day one I built a syntax tree model and a code formatter that could take a tree I constructed and generate valid, nicely-formatted Wabbit code. This might look something like this:</p><pre><code class="language-Python"># The following returns &quot;print 4 + 5;&quot;
#
format(
    Statements(
        PrintStatement(
            Add(IntegerLiteral(&quot;4&quot;), IntegerLiteral(&quot;5&quot;))
        )
    )
)</code></pre><p>The example above is trivial but the model I built on the first day covered the entire language spec. It gave me a solid foundation to build a parser the next day. I also had a practical tool for Wabbit and got to appreciate how code formatters I use every day are built. <a href="https://github.com/psf/black/blob/main/src/black/linegen.py?ref=vikramoberoi.com">Python Black&apos;s line generation code</a> is not far from what we implemented on day one.</p><p>I think that is an awesome result to have so quickly, and it&apos;s what you can expect daily.</p><p>I won&apos;t go into more detail here because David experiments with his course and spends a ton of time on its design and his materials.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/11/Screen-Shot-2022-11-19-at-10.37.28-AM.png" class="kg-image" alt loading="lazy" width="1246" height="642" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/11/Screen-Shot-2022-11-19-at-10.37.28-AM.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/11/Screen-Shot-2022-11-19-at-10.37.28-AM.png 1000w, https://www.vikramoberoi.com/content/images/2022/11/Screen-Shot-2022-11-19-at-10.37.28-AM.png 1246w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Emitting useful error messages was easily the most difficult part of the course for me. It is really tough to do this well and where I made the least progress.</span></figcaption></figure><p>If you&apos;re going to take the course, note that it is a ~40 hour commitment. There are about 3 hours of discussion daily with the rest of the time dedicated to hands-on implementation. There&apos;s a Gitter chat where David and all the students can ask questions and share code and confusion. Everyone&apos;s working on their implementation in the open, David included, so you can pop into different branches and see approaches folks are taking.</p><p>The course is light on theory. I took a compilers course in college about 15 years ago and I think we spent ~2-3 weeks investigating different parsing techniques and talking about automata and grammars. David does not do that.</p><p>Yet I have a solid foundation for further study after this week. I know how topics like parsing, type checking, semantic analysis, garbage collection, and code optimization fit in the language implementation landscape. I understand the contours of those topics, what to search for, and how I might go about navigating code to explore them in open source projects. I <em>think</em> I know where I would need to begin with any of those in my Wabbit implementation.</p><p>One neat thing I want to note is that we talked about error message generation more than I expected. David&apos;s keen on figuring out how to do this well. He mentioned that compiler books all punt on the topic of error messages, but he doesn&apos;t because they&apos;re core to making a programming language usable. </p><p>Importantly, it&apos;s representative of where his course falls on the spectrum between theory and practice. </p><p>David&apos;s a great teacher. If you haven&apos;t seen any of his talks on YouTube yet, seek them out. His compilers course is a fantastic way to survey the field as a practitioner. You&apos;ll write a lot of code and build an interpreter <em>and</em> a compiler in five days. The work will challenge you but you&apos;ll learn a lot and be rewarded with some exciting results.</p><p>If that sounds like your cup of tea, take the course. Sign up for compilers, or another one of David&apos;s courses, here: <a href="https://dabeaz.com/?ref=vikramoberoi.com">https://dabeaz.com/</a>.</p><p><strong>Quick tips for folks about to take the course</strong></p><p>If you&apos;re about to take the course, here are some quick tips for you:</p><ul><li><strong>Definitely use a program language you are comfortable in. </strong>I do not recommend using this as an opportunity learn a new language. Make no mistake: you will write a <em>lot </em>of code and you don&apos;t want other details to distract you from the material at hand.</li><li><strong>In a similar vein, this isn&apos;t a good time to mess with your editor/environment. </strong>You might be tempted to try out a new editor or something. I don&apos;t recommend doing that either. It&apos;s distracting.</li><li><strong>Poke around everyone&apos;s branches. </strong>It is super interesting to see how other folks approach the same problem. You might learn something new.</li><li><strong>Get good sleep, eat well, etc. </strong>I guess this should apply to... every day life. But the pace of David&apos;s course is intense and my brain was fried every day. I couldn&apos;t have kept pace if I wasn&apos;t feeling good.</li><li><strong>Have fun! </strong>His class will challenge and confuse you. You will make a lot of mistakes and introduce a lot of bugs. Your code will probably be a mess. Don&apos;t worry about it, roll with it, and enjoy the process! It&apos;s fun to hack on something new.</li></ul><hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[How I made atariemailarchive.org]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg" class="kg-image" alt loading="lazy" width="2000" height="2100" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 1600w, https://www.vikramoberoi.com/content/images/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">An artist&apos;s rendition of the author reading Jed Margolin&apos;s emails on the NYC subway. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F973;</div><div class="kg-callout-text">I&apos;ve published the data behind <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> under a Creative Commons license. <a href="https://github.com/voberoi/atariemailarchive-data?ref=vikramoberoi.com">You can find it on Github.</a></div></div><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F399;&#xFE0F;</div><div class="kg-callout-text">Update (March 29, 2023): You can</div></div>]]></description><link>https://www.vikramoberoi.com/how-i-made-atariemailarchive-org/</link><guid isPermaLink="false">631744fa6a7edc003d24c0a1</guid><category><![CDATA[atariemailarchive.org]]></category><category><![CDATA[atari]]></category><category><![CDATA[email]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Fri, 09 Sep 2022 16:27:52 GMT</pubDate><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg" class="kg-image" alt loading="lazy" width="2000" height="2100" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 1600w, https://www.vikramoberoi.com/content/images/2022/09/dc4fa6867e1a75a758ea4af58bffd498371617dd.jpg 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">An artist&apos;s rendition of the author reading Jed Margolin&apos;s emails on the NYC subway. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F973;</div><div class="kg-callout-text">I&apos;ve published the data behind <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> under a Creative Commons license. <a href="https://github.com/voberoi/atariemailarchive-data?ref=vikramoberoi.com">You can find it on Github.</a></div></div><div class="kg-card kg-callout-card kg-callout-card-yellow"><div class="kg-callout-emoji">&#x1F399;&#xFE0F;</div><div class="kg-callout-text">Update (March 29, 2023): You can listen to me talk about atariemailarchive.org on <a href="https://podcast.data-is-plural.com/2159594/12535597-s1e5-atari-emails?ref=vikramoberoi.com">episode 5 of Data is Plural&apos;s inaugural podcast season.</a></div></div><p>In 2015 I discovered <a href="https://www.jmargolin.com/vmail/vmail.htm?ref=vikramoberoi.com">Jed Margolin&apos;s emails from his tenure as a hardware engineer at Atari</a>: 4,128 messages he sent and received between 1983 and 1992.</p><p>... and then I read them all during my subway commute to work for the next several months.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/09/blogimage.png" class="kg-image" alt loading="lazy" width="776" height="770" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/09/blogimage.png 600w, https://www.vikramoberoi.com/content/images/2022/09/blogimage.png 776w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">This is what it looks like reading Jed&apos;s emails from </span><a href="http://www.jmargolin.com/vmail/vmail.htm?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">his website</span></a><span style="white-space: pre-wrap;"> on an iPhone. </span><b><strong style="white-space: pre-wrap;">Not pictured:</strong></b><span style="white-space: pre-wrap;"> being jostled around on a crowded train commute while reading it.</span></figcaption></figure><p>Not all four thousand messages in Jed&apos;s inbox are captivating. I&apos;ll admit I glossed over a lot of the 1989 <a href="https://atariemailarchive.org/thread/stun-runner-status-report-936?ref=vikramoberoi.com">project status reports</a> for <a href="https://en.wikipedia.org/wiki/S.T.U.N._Runner?ref=vikramoberoi.com">Stun Runner</a>. But they were collectively fascinating. I kept finding <a href="https://atariemailarchive.org/top-threads?ref=vikramoberoi.com">gems</a>.</p><p>I loved all the day-to-day mundanity too: coworkers throwing wee jokes into their status reports, minutes from engineering meetings, Jed&apos;s constant displeasure with Bob Frye from facilities. (Why are there always ants in the cafeteria, Bob?!)</p><p>Technology&apos;s changed since the 80&apos;s and that came across loudly when I read Jed&apos;s emails. But all the human interaction and how engineers built technology and did creative work felt the same. Reading Jed&apos;s emails was relatable and comforting.</p><p>In 2017, two years after I discovered Jed&apos;s emails, I quit my job and took some time off. That spring I contacted Jed for permission to work on what would later become <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a>.</p><blockquote>Hi, Vikram.<br><br>Yes, I am ok with it (as long as you do not use it for any nefarious<br>purposes).  :-)<br><br>I don&apos;t think there is any legal stuff to do.<br><br>And thank you. I don&apos;t want this stuff to disappear when I am gone. (I am<br>old and decrepit.)<br><br>Regards,<br><br>Jed</blockquote><p>Confident that I met Jed&apos;s stringent criteria not to weaponize his publicly-available professional emails from the 80&apos;s, I began work.</p><p>I contacted Jed on May 15th and released <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> on December 15th. I worked on the project on and off over seven months.</p><p><a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> allows folks to enjoy Jed&apos;s emails without obsessively reading <em>everything </em>like I did. It gives readers threaded conversations, curation with a <a href="https://atariemailarchive.org/top-threads?ref=vikramoberoi.com">&quot;best of&quot; list</a>, and an easier-to-read interface than <a href="https://www.jmargolin.com/vmail/Vax83.txt?ref=vikramoberoi.com">a giant text file</a>.</p><p>But that&apos;s not how it started. I built a version and wrote a bunch of code that I scrapped before I built what you see on <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> today.</p><h1 id="the-version-that-never-saw-the-light-of-day">The version that never saw the light of day</h1><p>My goal with <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> is simple: I want it to be able to hook you during your lunch hour.</p><p>As a proxy to test if you might get hooked, I just want any indication that you are consumed by <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> when I tell you about it. If you tap around, chuckle a lot, and start ignoring me, that is sufficient.</p><p>By this rigorous and scientific measure, the first version I built failed spectacularly. Readers got bored quickly.</p><p>I had initially built a search interface and an easier way to read and navigate all of Jed&apos;s individual messages. When I showed it to folks, they tapped around, searched for <code>jobs</code> and <code>wozniak</code> (neither appears in the archive), and then asked me what they should look for.</p><p>If you aspire to curate a someone&apos;s public emails some day, here&apos;s a tip: search is an awful way to curate text when your readers don&apos;t know what is interesting about it. This is obvious in retrospect, but it didn&apos;t occur to me until I showed what I built to folks.</p><p>For example, a lot of folks have created neat search interfaces for the Enron emails. I <em>know </em>there are fascinating emails in that archive, yet I&apos;ve barely read read any of them. I don&apos;t know where to look.</p><p>I realize it is significantly harder given that the Enron archive contains 500,000 messages, but I hope someone creates something like <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> for the Enron emails some day, somehow.</p><p>I know I&apos;d spend my lunch hour reading it.</p><h1 id="the-version-that-did-see-the-light-of-day">The version that <em>did</em> see the light of day</h1><p>I realized that if I wanted folks to see what I saw in the archive, I&apos;d need to guide them directly to it. <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> would need resemble the emails I sent to friends and coworkers when I wanted to share fun messages I found in Jed&apos;s inbox.</p><p>That&apos;s what led me to do three things:</p><ol><li>I created conversation threads from every message in the archive.</li><li>I assembled a <a href="https://atariemailarchive.org/top-threads?ref=vikramoberoi.com">&quot;best of&quot; list</a> with some commentary.</li><li>I <a href="https://atariemailarchive.org/categories?ref=vikramoberoi.com">tagged threads</a>.</li></ol><p>I actually tried #2 and #3 on Jed&apos;s individual messages first to test it out. It worked. Folks tapped around and chuckled a lot. They told me they wanted more.</p><p>But when I tried to tag more messages and add them to the &quot;best of&quot; list, I quickly realized how many messages existed in a larger thread, and how important that context was for all the little discoveries I made to truly<em> hit</em>. You can see this easily if you click around on <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> today.</p><p>Threads did not exist in the 80&apos;s. Jed&apos;s inbox was just a bunch of email ordered chronologically. So I bit the bullet and created conversation threads from <em>all 4,128 of Jed&apos;s messages.</em></p><p>As you can imagine, this was tedious. It was the most time consuming part of this project. I was living in Seattle at the time and I&apos;d like to thank the folks at <a href="http://www.porchlightcoffee.com/?ref=vikramoberoi.com">Porchlight Coffee</a> for letting me hang out there over many, many afternoons while I manually threaded all of Jed&apos;s emails.</p><h1 id="whats-next-for-the-archive">What&apos;s next for the archive?</h1><p>I decided to publish the dataset I created for <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> under a Creative Commons license today. You can find it <a href="https://github.com/voberoi/atariemailarchive-data?ref=vikramoberoi.com">on Github</a>.</p><p>I would be thrilled to see people do interesting things with it. If you do, please let me know.</p><p>People use the archive in different ways. Occasionally readers will link to messages that help them in some investigation or research they&apos;re doing. <a href="https://twitter.com/PhilBennett3D/status/1395432015214759937?s=20&amp;t=OywCQlWoDaztgPOCX4C2hA&amp;ref=vikramoberoi.com">Here&apos;s one of my favorites.</a> Jed received a price list for custom parts made for Atari hardware, and the person who authored this tweet found them in a breakdown of an arcade cabinet. You can see the part ids match in the image and the email.</p><p>It has been a joy to see folks link to and enjoy the archive and I hope to see more of that. I get messages of appreciation every now and then. I love receiving those. It validates all the time I spent on this bizarre obsession.</p><h1 id="technical-notes-for-those-interested">Technical notes for those interested</h1><p><a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> began its life as a Python <a href="https://trypyramid.com/?ref=vikramoberoi.com">Pyramid</a> app with a SQLite back-end in 2017. It ran on a $5/month Linode instance until early September 2022.</p><p>This week I ported it Python Flask and deployed it to <a href="https://fly.io/?ref=vikramoberoi.com">Fly.io</a>. The app is trivial and porting it to Python Flask took two afternoons. Deploying <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a> to Fly.io took very little time &#x2013; Fly.io was a joy to use and it will cost me nothing to host the archive there.</p><p>The <a href="https://github.com/voberoi/atariemailarchive-data?ref=vikramoberoi.com">dataset on Github </a>is in a SQLite file. It contains structured, threaded messages that I parsed and assembled from text files on Jed&apos;s site. It does not contain any of the curation and editorialization on <a href="https://atariemailarchive.org/?ref=vikramoberoi.com">atariemailarchive.org</a>.</p>
<!--kg-card-begin: html-->
<span style="font-size: 0.9em; font-style: italic">Thanks to Katie Brookoff.</span>
<!--kg-card-end: html-->
<hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[A primer on Roaring bitmaps: what they are and how they work]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png" class="kg-image" alt loading="lazy" width="2000" height="2100" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 1600w, https://www.vikramoberoi.com/content/images/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Unfortunately not what this post is about. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><p>I came across Roaring bitmaps when I learned about <a href="https://www.vikramoberoi.com/using-bitmaps-to-run-interactive-retention-analyses-over-billions-of-events-for-less-than-100-mo/">this fun hack to do retention analyses at scale with bitmaps</a>. Using Roaring bitmaps instead of traditional bitmaps in that application reduced memory usage from ~125GB to 300MB, an impressive 99.</p>]]></description><link>https://www.vikramoberoi.com/a-primer-on-roaring-bitmaps-what-they-are-and-how-they-work/</link><guid isPermaLink="false">6307e9f484a6aa004db13c94</guid><category><![CDATA[bitmaps]]></category><category><![CDATA[roaring bitmaps]]></category><category><![CDATA[data structures]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Fri, 02 Sep 2022 15:11:35 GMT</pubDate><media:content url="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.43.55.png" medium="image"/><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="2000" height="2100" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 1600w, https://www.vikramoberoi.com/content/images/2022/08/24af07058354b122975355ca6527a5cdc2f7a1eb.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">Unfortunately not what this post is about. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><img src="https://www.vikramoberoi.com/content/images/2023/05/Xnapper-2023-05-30-22.43.55.png" alt="A primer on Roaring bitmaps: what they are and how they work"><p>I came across Roaring bitmaps when I learned about <a href="https://www.vikramoberoi.com/using-bitmaps-to-run-interactive-retention-analyses-over-billions-of-events-for-less-than-100-mo/">this fun hack to do retention analyses at scale with bitmaps</a>. Using Roaring bitmaps instead of traditional bitmaps in that application reduced memory usage from ~125GB to 300MB, an impressive 99.8% savings.</p><p>But... how?</p><p>You can learn about Roaring bitmaps over two research papers: </p><ol><li><a href="https://arxiv.org/pdf/1402.6407.pdf?ref=vikramoberoi.com">This one proposes the data structure.</a></li><li><a href="https://arxiv.org/pdf/1603.06549.pdf?ref=vikramoberoi.com">This one introduces a critical optimization.</a></li></ol><p>In this post I briefly describe what bitmaps are, what they&apos;re used for, and what Roaring bitmaps solve that traditional bitmaps don&apos;t. Then, I distill the high-level structure of Roaring bitmaps and how they work, one step at a time.</p><p>Roaring bitmaps employ a number of algorithms, techniques, and heuristics that I won&apos;t go into in detail. They also offer some operations beyond the ones I describe. These details are not critical to understanding the basic internal structure and operation of Roaring bitmaps which is the focus of this post.</p><p>Let&apos;s begin!</p><h1 id="what-are-bitmaps-and-what-are-they-used-for">What are bitmaps and what are they used for?</h1><p><a href="https://en.wikipedia.org/wiki/Bit_array?ref=vikramoberoi.com">Bitmaps</a> are arrays of bits used to store sets of integers.</p><p>They work by setting the Nth bit when an integer N is in the set, as illustrated in <strong>Figure 1.</strong></p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/bitmap-3.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="534" height="340"><figcaption><b><strong style="white-space: pre-wrap;">Figure 1. </strong></b><span style="white-space: pre-wrap;">An illustration of how a bitmap works.</span></figcaption></figure><p>By storing sets of integers this way, bitmaps can take advantage of extremely fast bitwise-AND and bitwise-OR CPU instructions to compute set intersections and unions.</p><p>It turns out that fast set intersections and unions are critical for many search and database applications. Various operations exist in search and database indexes that boil down to having two sets of integers and needing to intersect or union them quickly.</p><p>Take an inverted search index, for example:</p><ul>
<li><strong>You&apos;ve indexed billions of documents.</strong> Each document has an integer id.</li>
<li><strong>The index maps terms to a set of documents in which they appear.</strong> For example, the term <code>pigeon</code> appears in documents with these ids: <code>{2, 345, 2034, ...}</code>.</li>
<li><strong>Queries that search across terms use set operations.</strong> In order to resolve a search query like <code>carrier AND pigeon</code> you need the intersection of the set of documents that contain <code>carrier</code> and the set of documents that contain <code>pigeon</code>.</li>
<li><strong>Bitwise operations can perform these set operations quickly.</strong> If you  represent sets of document ids as bitmaps, the query above is a bitwise-AND.</li>
</ul>
<p>Columnar databases use set operations similarly for certain classes of queries.</p><p>If you&apos;d like to dive into a use case outside search and databases, <a href="https://www.vikramoberoi.com/using-bitmaps-to-run-interactive-retention-analyses-over-billions-of-events-for-less-than-100-mo/">a previous post I wrote</a> discusses how bitmaps can be used to analyze user retention for SaaS products.</p><p>Unfortunately, bitmaps suffer from awful compression in common cases involving very large sets of integers &#x2013; scenarios that appear frequently in the use cases I just described.</p><p>Recall the figure I cited at the beginning of the post:</p><blockquote>Using Roaring bitmaps instead of traditional bitmaps in that application reduced memory usage from ~125GB to 300MB, an impressive 99.8% savings.</blockquote><p><em>That&apos;s</em> the problem with bitmaps, and I&apos;ll walk you through why this happens shortly.</p><h1 id="what-are-roaring-bitmaps">What are Roaring bitmaps?</h1><p>From <a href="https://roaringbitmap.org/?ref=vikramoberoi.com">roaringbitmap.org</a>:</p><blockquote>Roaring bitmaps are compressed bitmaps. They can be hundreds of times faster.</blockquote><p>A great, pithy marketing statement on the Roaring bitmap website! Let&apos;s expand on it a wee bit.</p><p>Roaring bitmaps are just optimized bitmaps, <em>which I&apos;ll refer to henceforth as &quot;traditional bitmaps&quot;</em>.</p><p>Both traditional and Roaring bitmaps offer a set data structure for integers. You can insert integers, check the existence of an integer, and get the intersection and union of two sets of integers.</p><p>Roaring bitmaps offer better compression than traditional bitmaps. Importantly, they do so without significantly sacrificing the performance of set operations.</p><p><a href="https://roaringbitmap.org/?ref=vikramoberoi.com">roaringbitmap.org</a> boasts an impressive list of OLAP databases and search systems that use Roaring bitmaps under the hood. These are all applications that:</p><ul><li>... need to store large sets of integers</li><li>... in as little memory as possible</li><li>... and execute fast set operations.</li></ul><h1 id="what-problem-do-roaring-bitmaps-solve-that-traditional-bitmaps-dont">What problem do Roaring bitmaps solve that traditional bitmaps don&apos;t?</h1><p>When a set is <strong>sparse</strong>, traditional bitmaps compress poorly.</p><p>Recall that traditional bitmaps will set the Nth bit when you add an integer N to it (see <strong>Figure 1</strong>).</p><p>Let&apos;s say you have an empty traditional bitmap to which you add the integer 8,000,000. Here&apos;s what will happen:</p><ul><li>It will allocate 1,000,000 bytes.</li><li>It will set the 8,000,000th bit.</li></ul><p>This is illustrated in <strong>Figure 2</strong>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/8mm-bit.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="590" height="366"><figcaption><b><strong style="white-space: pre-wrap;">Figure 2. </strong></b><span style="white-space: pre-wrap;">What happens if you allocate the 8 millionth bit outright in an empty bitmap.</span></figcaption></figure><p>Why is this bad?</p><ul><li>Your set has 1 integer.</li><li>An integer takes up 4 bytes.</li><li>Your traditional bitmap has allocated 1 megabyte.</li></ul><p>That&apos;s 6 orders of magnitude more memory than you need.</p><p>Whoops!</p><p>Roaring bitmaps solve this problem. Importantly, they do so <strong>while maintaining fast set operations</strong>. This is what makes Roaring bitmaps special.</p><p>Prior research attempts to solve poor compression in bitmaps achieve impressive results too, but at the cost of efficient set operations.</p><h1 id="how-do-roaring-bitmaps-work">How do Roaring bitmaps work?</h1><p>There isn&apos;t one deep insight that allows Roaring bitmaps to perform well. But there is <em>a lot of stuff</em> going on that is greater than the sum of its parts. </p><p>The following builds up an understanding of Roaring bitmaps one concept at a time.</p><h2 id="part-1-how-roaring-bitmaps-are-represented-in-memory">Part 1: how Roaring bitmaps are represented in memory</h2><h3 id="all-32-bit-integers-are-partitioned-into-contiguous-chunks">All 32-bit integers are partitioned into contiguous chunks.</h3><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/chunks-1.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="1276" height="678" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/chunks-1.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/chunks-1.png 1000w, https://www.vikramoberoi.com/content/images/2022/08/chunks-1.png 1276w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Figure 3. </strong></b><span style="white-space: pre-wrap;">How the space of 32-bit integers is partitioned into chunks in Roaring bitmaps.</span></figcaption></figure><p>Each chunk shares the same 16 most significant bits.</p><p>As shown in <strong>Figure 3, </strong>the partitioning scheme used by Roaring bitmaps ensures that an integer will always belong to the same chunk of 2^16, or 65,536 consecutive integers.</p><p><strong>Note: </strong>there are 64-bit implementations of Roaring bitmaps, but this post does not go into them. See <a href="https://github.com/RoaringBitmap/CRoaring?ref=vikramoberoi.com">CRoaring</a>, this <a href="https://github.com/outcaste-io/sroar?ref=vikramoberoi.com">native Go implementation</a>, and, uh, <a href="https://r-libre.teluq.ca/930/1/Roaring64bits.pdf?ref=vikramoberoi.com">this paper written in French</a> (if you find a translation, please let me know).</p><h3 id="integers-in-the-same-chunk-are-stored-in-containers">Integers in the same chunk are stored in containers.</h3><p>Chunks are how integers are logically partitioned in Roaring bitmaps. All integers that belong to a chunk are physically (in memory) stored in the same <strong><em>container</em></strong>.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/containers-3.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="1696" height="964" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/containers-3.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/containers-3.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/08/containers-3.png 1600w, https://www.vikramoberoi.com/content/images/2022/08/containers-3.png 1696w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Figure 4. </strong></b><span style="white-space: pre-wrap;">Three contrived examples of containers from the first Roaring bitmap paper.&#xA0;</span></figcaption></figure><p><strong>Figure 4 </strong>shows examples of three different containers for three different chunks.</p><p>A chunk will always only have one container in a Roaring bitmap, defined by the 16 most significant bits of all the integers in the chunk.</p><p>If your program inserted the first 1,000 multiples of 62 into a Roaring bitmap, then they would be end up in the left-most container in <strong>Figure 4. </strong>That container&apos;s cardinality would be 1,000.</p><p>If you later inserted the integer 63, it would end up in the same container. The container&apos;s cardinality would then be 1,001.</p><p>As you&apos;ll see next, the cardinality of a container determines how it will be represented in memory.</p><h3 id="sparse-containers-contain-4096-integers-these-are-stored-as-sorted-packed-arrays">Sparse containers contain &lt;= 4,096 integers. These are stored as sorted packed arrays.</h3><p>Two of the containers in in <strong>Figure 4</strong> are sparse (with cardinalities 1,000 and 100) so they will be stored as <strong><em>sorted packed arrays </em></strong>of 16-bit integers.</p><p>By <strong>packed</strong>, we mean that we will be packing 32-bit integers into 16-bit integers as shown in <strong>Figure 5</strong>. </p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/sparse-containers-2.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="1668" height="522" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/sparse-containers-2.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/sparse-containers-2.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/08/sparse-containers-2.png 1600w, https://www.vikramoberoi.com/content/images/2022/08/sparse-containers-2.png 1668w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Figure 5. </strong></b><span style="white-space: pre-wrap;">Two sparse Roaring bitmap containers from Figure 2 alongside examples of how they are stored in memory.</span></figcaption></figure><p>Packing integers is possible because each container can store at most 2^16 distinct integers. To get the original 32-bit integer in a sparse container, we have to unpack<strong> </strong>it by combining the 16-bit integer with its 16 most significant bits.</p><p>These arrays are dynamically allocated so the memory used by a sparse container grows as it accrues integers.</p><h3 id="dense-containers-contain-4096-integers-these-are-stored-as-bitmaps">Dense containers contain &gt; 4,096 integers. These are stored as bitmaps.</h3><p>One of the containers in <strong>Figure 4 </strong>is dense (with cardinality 2<sup>15</sup>) so it will be stored as a traditional bitmap.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/dense-containers-2.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="1668" height="288" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/dense-containers-2.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/dense-containers-2.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/08/dense-containers-2.png 1600w, https://www.vikramoberoi.com/content/images/2022/08/dense-containers-2.png 1668w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Figure 6. </strong></b><span style="white-space: pre-wrap;">A dense Roaring bitmap container from Figure 2 alongside an example of how it is stored in memory.</span></figcaption></figure><p>Dense containers are bitmaps containing 2^16 bits (8 kilobytes), allocated outright. The Nth bit in the bitmap maps to the Nth integer in a chunk.</p><h3 id="a-first-level-index-points-to-all-containers-the-index-is-stored-as-a-sorted-array">A first-level index points to all containers. The index is stored as a sorted array.</h3><p>The first-level index stores the 16 most significant bits for each container in the Roaring bitmap along with a pointer to the corresponding container.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/index-1.png" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="1234" height="600" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/index-1.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/index-1.png 1000w, https://www.vikramoberoi.com/content/images/2022/08/index-1.png 1234w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Figure 7. </strong></b><span style="white-space: pre-wrap;">The containers described in Figure 2, 3, and 4 with a first-level index pointing to them.</span></figcaption></figure><p>The index is stored as a sorted array and grows dynamically as new containers are added to the Roaring bitmap.</p><h2 id="part-2-how-set-operations-work-with-roaring-bitmaps">Part 2: how set operations work with Roaring bitmaps</h2><h3 id="integer-insertion-varies-by-container-type-and-may-cause-a-containers-type-to-change">Integer insertion varies by container type and may cause a container&apos;s type to change.</h3><p>To insert an integer, N, get N&apos;s 16 most significant bits (<code>N / 2^16</code>) and use it to find N&apos;s corresponding container in the Roaring bitmap.</p><p>Insertions in array and bitmap containers work differently:</p><ul><li><strong>Bitmap: </strong>set the bit at <code>N % 2^16</code>.</li><li><strong>Array: </strong>insert <code>N % 2^16</code> in its position in the sorted array.</li></ul><p><em>Insertions may change the container type. </em>If an array container has 4,096 integers, first convert it to a bitmap container. Then set the bit at <code>N % 2^16</code>.</p><p>If a container doesn&apos;t already exist then create a new array container, add it to the Roaring bitmap&apos;s first-level index, and add N to the array.</p><h3 id="checking-for-existence-varies-by-container-type">Checking for existence varies by container type.</h3><p>To check if an integer N exists, get N&apos;s 16 most significant bits (<code>N / 2^16</code>) and use it to find N&apos;s corresponding container in the Roaring bitmap.</p><p>If the container doesn&apos;t exist, then N is not in the Roaring bitmap.</p><p>Checking for existence in array and bitmap containers works differently:</p><ul><li><strong>Bitmap: </strong>check if the bit at <code>N % 2^16</code> is set.</li><li><strong>Array: </strong>use binary search to find <code>N % 2^16</code> in the sorted array.</li></ul><h3 id="intersect-matching-containers-to-intersect-two-roaring-bitmaps-algorithms-vary-by-container-types-and-container-types-may-change">Intersect matching containers to intersect two Roaring bitmaps. Algorithms vary by container type(s), and container types may change.</h3><p>To intersect Roaring bitmaps A and B, it is sufficient to intersect matching containers in A and B.</p><p>This is possible because of how integers are partitioned in Roaring bitmaps: matching containers in A and B store integers with the same 16 most significant bits (the same chunks).</p><p>Intersection algorithms vary by the types of the containers involved, as do the resulting container types:</p><ul><li><strong>Bitmap /</strong> <strong>Bitmap: </strong>Compute the bitwise AND of the two bitmaps. If the cardinality is &lt;= 4,096, store the result in an array container, otherwise store it in a bitmap container.</li><li><strong>Bitmap / Array: </strong>Iterate over the array, checking for the existence of each 16-bit integer in the bitmap. If the integer exists, add it to the resulting array container &#x2013; note that intersections of bitmap and array container types will always create an array container.</li><li><strong>Array / Array: </strong>Intersections of two array containers always create a new array container. The algorithm used to compute the intersection varies by a <a href="https://arxiv.org/pdf/1402.6407.pdf?ref=vikramoberoi.com">cardinality heuristic described at the bottom of page 5 here</a>. It will either be a simple merge (as used in merge sort) or a galloping intersection, <a href="https://dl.acm.org/doi/10.1145/1877766.1877767?ref=vikramoberoi.com">described in this paper</a>.</li></ul><p>If there is a container in either Roaring bitmap without a corresponding container in the other, it will not exist in the result: the intersection of an empty set and any set is an empty set.</p><h3 id="union-matching-containers-to-produce-a-roaring-bitmap-union-algorithms-vary-by-container-types-and-container-types-may-change">Union matching containers to produce a Roaring bitmap union. Algorithms vary by container type(s), and container types may change.</h3><p>To union Roaring bitmaps A and B, union all matching containers in A and B. </p><p>Union algorithms vary by the container types involved, as do the resulting container types:</p><ul><li><strong>Bitmap / Bitmap: </strong>Compute the bitwise OR of the two bitmaps. Unions of two bitmap containers will always create another bitmap container.</li><li><strong>Bitmap / Array: </strong>Copy the bitmap and set corresponding bits for all the integers in the array container. Unions of a bitmap and array container will always create another bitmap container.</li><li><strong>Array / Array: </strong>If the sum of the cardinalities of the two array containers is &lt;= 4,096, the resulting container will be an array container. In this case, add all integers from both arrays to a new array container. Otherwise, optimistically assume the resulting container will be a bitmap: create a new bitmap container and set all corresponding bits for all integers in both arrays. If the cardinality of the resulting container is &lt;= 4,096, convert the bitmap container back into an array container.</li></ul><p>Finally, add all containers in A and B that do not have a matching container to the result. Remember: this is a union, so all integers in Roaring bitmaps A and B must be in the resulting set.</p><h2 id="part-3-how-a-third-and-final-container-type-%E2%80%93-the-run-container-%E2%80%93-optimizes-long-runs-of-consecutive-integers">Part 3: how a third and final container type &#x2013; the &quot;run&quot; container &#x2013; optimizes long runs of consecutive integers.</h2><p>Parts 1 and 2 of this post cover most of the internal structure and operation of Roaring bitmaps. This final part covers an important optimization described in <a href="https://arxiv.org/pdf/1603.06549.pdf?ref=vikramoberoi.com">the second Roaring bitmap paper.</a></p><h3 id="run-containers-represent-runs-of-consecutive-integers-with-two-16-bit-integers-the-run-start-and-run-length">Run containers represent runs of consecutive integers with two 16-bit integers: the run start and run length.</h3><p>From page 3 of the <a href="https://arxiv.org/pdf/1603.06549.pdf?ref=vikramoberoi.com">second Roaring bitmap paper</a>:</p><blockquote>The new container is conceptually simple: given a run (e.g., [10, 1000]), we store the starting point (10) and its length minus one (990) ... packing the starting points and the lengths in pairs, using 16 bits each ...</blockquote><p>This technique is known as <a href="https://en.wikipedia.org/wiki/Run-length_encoding?ref=vikramoberoi.com">run-length encoding</a> and seems to back most of the prior art described in the two papers. Run-length encoding can compress bitmaps effectively but degrades performance for set operations in many cases.</p><h3 id="run-containers-are-formed-explicitly-when-a-client-invokes-a-runoptimize-function-or-in-some-cases-implicitly-when-a-large-range-is-added-to-the-roaring-bitmap">Run containers are formed explicitly when a client invokes a <em>runOptimize </em>function or, in some cases, implicitly when a large range is added to the Roaring bitmap.</h3><p>Unlike sparse and dense containers, run containers generally do not materialize automatically.</p><ol><li>Clients can invoke <em>runOptimize </em>to optimize their Roaring bitmap for large runs of consecutive integers. Run containers <strong><em>may </em></strong>replace existing array or bitmap containers in this case.</li><li>Roaring bitmaps offer an operation to add a range of values. Run containers <strong><em>may </em></strong>materialize automatically in this case.</li></ol><p>The papers don&apos;t actually prescribe how or when #2 should happen. I&apos;d guess that if a range of values were added for a chunk that does not yet have a container, it makes sense to create a run container instead of an array or bitmap container.</p><h3 id="runoptimize-only-creates-a-run-container-if-it-will-be-smaller-than-the-container-it-would-replace"><em>runOptimize </em>only creates a run container if it will be smaller than the container it would replace.</h3><p><em>runOptimize </em>first counts the number of runs in a container.</p><p>Then, it decides whether or not to create a run container using a simple heuristic: the run container must be smaller than its equivalent array or bitmap container. </p><p>The algorithms used to count runs and a description of how to compute the heuristic above are described on <a href="https://arxiv.org/pdf/1603.06549.pdf?ref=vikramoberoi.com">page 6 and 7 in the second Roaring bitmap paper</a>.</p><p>If you&apos;d like to work out computing the heuristic yourself, it&apos;ll help to recall the following:</p><ul><li>... array containers contain no more than 4,096 integers, packed into 16-bits each.</li><li>... bitmap containers contain &gt; 4,096 integers in a bitmap with 2^16 bits (8,192 bytes).</li><li>... each run in a run container takes up 32 bits (16 bits for the start, 16 bits for the length).</li></ul><h3 id="the-addition-of-run-containers-introduces-new-algorithms-for-all-set-operations">The addition of run containers introduces new algorithms for all set operations.</h3><p>The Roaring bitmap papers do not describe the algorithms used to insert and check for the existence of integers in run containers: these operations are relatively straightforward.</p><p>But the addition of run containers requires that Roaring bitmaps implement performant algorithms for unions and intersections of <em>three </em>new container type pairs:</p><ul><li>Run / Run</li><li>Run / Array</li><li>Run / Bitmap</li></ul><p>These algorithms also introduce new heuristics to determine the resulting container type from these operations.</p><p>I won&apos;t go into the details of these algorithms as I did in Part 2. The algorithms are not significantly more complicated (their descriptions start <a href="https://arxiv.org/pdf/1603.06549.pdf?ref=vikramoberoi.com">on page 10 here</a> if you&apos;re curious), but the numerous details are beyond the scope of this post and the paper is well-written and makes them extremely accessible.</p><h1 id="roaring-bitmaps-a-game-of-performance-whack-a-mole">Roaring bitmaps: a game of performance whack-a-mole!</h1><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/6roe7c.jpg" class="kg-image" alt="A primer on Roaring bitmaps: what they are and how they work" loading="lazy" width="666" height="500" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/6roe7c.jpg 600w, https://www.vikramoberoi.com/content/images/2022/08/6roe7c.jpg 666w"><figcaption><span style="white-space: pre-wrap;">A Roaring bitmap creator hard at work.</span></figcaption></figure><p>Roaring bitmaps use a kitchen sink of algorithms and techniques to achieve better compression and faster performance than other bitmap implementations.</p><p>They&apos;re challenging to implement but they do the job extremely well, especially when used in OLAP workloads. The creators managed to root out inefficiencies in common yet very different scenarios &#x2013; sparse data, dense data, data with a lot of runs &#x2013; and address all of them at once.</p><p>And they go even further!</p><p>A <a href="https://arxiv.org/pdf/1709.07821v4.pdf?ref=vikramoberoi.com">third paper</a> describes an implementation the creators wrote in C that leverages vectorized algorithms they designed to use SIMD (single instruction multiple data) instructions. That implementation, CRoaring, along with bindings and implementations in multiple other languages are <a href="https://github.com/RoaringBitmap?ref=vikramoberoi.com">available here</a>. They&apos;re used in mainstream columnar databases and search applications and are actively maintained, improved, and optimized regularly.</p><p>Very cool.</p><h1 id="further-reading">Further reading</h1><p>If you enjoyed reading this post and want to dig in further, I recommend reading the Roaring bitmap papers. They&apos;re well-written and accessible.</p><ul><li><a href="https://arxiv.org/pdf/1402.6407.pdf?ref=vikramoberoi.com">Better bitmap performance with Roaring bitmaps</a></li><li><a href="https://arxiv.org/pdf/1603.06549.pdf?ref=vikramoberoi.com">Consistently faster and smaller compressed bitmaps with Roaring</a></li><li><a href="https://arxiv.org/pdf/1709.07821v4.pdf?ref=vikramoberoi.com">Roaring Bitmaps: Implementation of an Optimized Software Library</a></li></ul><p>Roaring bitmap implementations are <a href="https://github.com/RoaringBitmap?ref=vikramoberoi.com">available on Github</a>.</p>
<!--kg-card-begin: html-->
<span style="font-size: 0.9em; font-style: italic">Thanks to Chuck Groom, Andy O&apos;Neill, Phil Eaton, Ben Johnson, and Simon Willems.</span>
<!--kg-card-end: html-->
<hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item><item><title><![CDATA[Using bitmaps to run interactive retention analyses over billions of events for less than $100/mo]]></title><description><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png" class="kg-image" alt="A cartoon of a woman being quoted $100K to run a retention analysis in a product analytics tool." loading="lazy" width="2000" height="2100" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 1600w, https://www.vikramoberoi.com/content/images/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A dramatic encounter between a sales prospect and an account executive at a major product analytics SaaS company. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><p>Instead of paying <a href="mixpanel.com">Mixpanel</a> over $2K/month for retention reporting functionality in 2012 Amir Salihefendic (founder of <a href="doist.com">Doist</a>, makers of <a href="https://todoist.com/?ref=vikramoberoi.com">Todoist</a>) built a clever hack to get the functionality</p>]]></description><link>https://www.vikramoberoi.com/using-bitmaps-to-run-interactive-retention-analyses-over-billions-of-events-for-less-than-100-mo/</link><guid isPermaLink="false">62f169c71b396b003d4b3c70</guid><category><![CDATA[bitmaps]]></category><category><![CDATA[redis]]></category><category><![CDATA[product-analytics]]></category><dc:creator><![CDATA[Vikram Oberoi]]></dc:creator><pubDate>Wed, 10 Aug 2022 18:23:17 GMT</pubDate><content:encoded><![CDATA[<figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png" class="kg-image" alt="A cartoon of a woman being quoted $100K to run a retention analysis in a product analytics tool." loading="lazy" width="2000" height="2100" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 1000w, https://www.vikramoberoi.com/content/images/size/w1600/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 1600w, https://www.vikramoberoi.com/content/images/2022/08/e614a450c1fd0283db24262475260ec98a78def0.png 2000w" sizes="(min-width: 720px) 720px"><figcaption><span style="white-space: pre-wrap;">A dramatic encounter between a sales prospect and an account executive at a major product analytics SaaS company. Credit: </span><a href="https://katiebcartoons.com/?ref=vikramoberoi.com"><span style="white-space: pre-wrap;">katiebcartoons.com</span></a></figcaption></figure><p>Instead of paying <a href="mixpanel.com">Mixpanel</a> over $2K/month for retention reporting functionality in 2012 Amir Salihefendic (founder of <a href="doist.com">Doist</a>, makers of <a href="https://todoist.com/?ref=vikramoberoi.com">Todoist</a>) built a clever hack to get the functionality he needed. He estimates <a href="https://twitter.com/amix3k/status/1557954159957233665?ref=vikramoberoi.com">Doist has saved millions on product analytics tools through 2022</a> yet has run interactive retention analyses over billions of events.</p><p>Amir&apos;s hack was to use bitmaps and bitwise operations to answer common questions about cohorts of users. He built and open-sourced <a href="https://github.com/Doist/bitmapist?ref=vikramoberoi.com">bitmapist</a>, a Python library that tracks user events in Redis bitmaps and can generate retention/cohort reports.</p><p>It looks like Doist still uses bitmapist in production ten years later. As of this writing, Amir himself pushed the <a href="https://github.com/Doist/bitmapist/commit/04babe58e4a50a94364ccdc1991256550f654cce?ref=vikramoberoi.com">latest commit</a> three weeks ago.</p><p>This post walks you through how bitmapist works and its limitations/pitfalls. I won&apos;t cover how to read and interpret a retention report. If you&apos;d like a primer, <a href="https://clevertap.com/blog/cohort-analysis/?ref=vikramoberoi.com">here&apos;s a pretty good one</a>.</p><hr><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/cohort-report.png" class="kg-image" alt loading="lazy" width="1150" height="522" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/cohort-report.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/cohort-report.png 1000w, https://www.vikramoberoi.com/content/images/2022/08/cohort-report.png 1150w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Figure 1. </strong></b><span style="white-space: pre-wrap;">A daily retention report for an app. Users enter cohorts when they sign up for the app. They return when they open the app.</span></figcaption></figure><p><strong>The key idea: each cell in a retention report represents an intersection of two sets of users.</strong></p><p>The retention report above shows:</p><ul><li>On <em>January 27th</em>, 1,257 users signed up for our app.</li><li>On <em>Day 2</em>,<strong> </strong>246 (19.6%) of the <em>January 27th</em> cohort opened the app.</li></ul><p><em>Day 2</em> relative to the <em>January 27th</em> cohort is <em>January 29th</em>.</p><p>Thousands of users may have opened our app on January 29th, but 246 of them also signed up for our app on January 27th.</p><p>Here&apos;s another way to state the above:</p><pre><code>A = Set of users who signed up on January 27th
B = Set of users who opened the app on January 29th

size(A) = 1257
size(B) = ?
size(A &#x2229; B) = 246

size(A &#x2229; B) / size(A) = 19.6%
</code></pre>
<p>Every cell in a retention report is computed the same way.</p><p><strong>The hack: store sets of users in bitmaps for fast set operations and high compression</strong></p><p>Bitmaps are an ideal data structure for this use case:</p><ul><li>Bitwise operations provide<em> </em>fast set intersections and unions.</li><li>Bitmaps compress this data well. A set of 8 million users fits in 1 megabyte.</li></ul><p>Bitmapist maintains user <a href="https://redis.io/docs/data-types/bitmaps/?ref=vikramoberoi.com">bitmaps in Redis</a>. Each bitmap is keyed on a <code>(time bin, event)</code> pair. If user ID  542 opens the app some time on January 25th 2022, then the bit at index 542 will be set on the bitmap with key <code>20220125:opened app</code> in the example below.</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://www.vikramoberoi.com/content/images/2022/08/bitmap.png" class="kg-image" alt loading="lazy" width="1080" height="808" srcset="https://www.vikramoberoi.com/content/images/size/w600/2022/08/bitmap.png 600w, https://www.vikramoberoi.com/content/images/size/w1000/2022/08/bitmap.png 1000w, https://www.vikramoberoi.com/content/images/2022/08/bitmap.png 1080w" sizes="(min-width: 720px) 720px"><figcaption><b><strong style="white-space: pre-wrap;">Figure 2. </strong></b><span style="white-space: pre-wrap;">How one might store user events in Redis bitmaps with daily time bins to generate the retention report in Figure 1.</span></figcaption></figure><p> Figure 2 shows daily time bins and two distinct events, but the idea generalizes to any time bin (hourly, daily, weekly, monthly, etc.) and any number of distinct events.</p><p>Bitwise ANDs enable high-performance set intersections to compute counts of distinct users who did our start event on one date and our return event on another date. If you&apos;re unfamiliar with how a bitwise AND operation works, <a href="https://en.wikipedia.org/wiki/Bitwise_operation?ref=vikramoberoi.com#AND">Wikipedia has a good example</a>. Redis implements bitwise operations via its <a href="https://redis.io/commands/bitop/?ref=vikramoberoi.com">BITOP command</a>.</p><p>Bitmapist provides a handy Python API for clients to track events (<code>mark_event(&quot;opened app&quot;, 542)</code>) and transparently maintains hour/day/week/month-binned bitmaps in the background. It also provides a library to generate retention reports interactively:</p><figure class="kg-card kg-image-card kg-card-hascaption"><img src="https://raw.githubusercontent.com/Doist/bitmapist/master/static/cohort_screenshot.png" class="kg-image" alt loading="lazy"><figcaption><b><strong style="white-space: pre-wrap;">Figure 3. </strong></b><span style="white-space: pre-wrap;">A retention report generated by Bitmapist. Image from https://github.com/Doist/bitmapist.</span></figcaption></figure><p><strong>Using Redis bitmaps for retention analyses is fast but extremely space-inefficient.</strong></p><p>The <a href="https://redis.io/commands/bitop/?ref=vikramoberoi.com">Redis BITOP docs</a> link to <a href="https://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/?ref=vikramoberoi.com">this article from 2011</a> simulating these techniques with 128MM users. Some relevant metrics from that article:</p><ul>
<li><strong>Set counts and intersections over 128MM users are <em>fast</em>.</strong> Interactive retention analyses are easily achievable with this technique.
<ul>
<li>50ms to count all the bits set in one bitmap.</li>
<li>392ms to intersect 7 bitmaps (a week of daily bins).</li>
<li>1624ms to intersect 30 bitmaps (a month of daily bins).</li>
</ul>
</li>
<li><strong>A bitmap of 128MM users takes up 16MB of memory.</strong> It&apos;s nice to see this play out predictably. <a href="https://gist.github.com/voberoi/9fb7affa5e3d2ae3aaef7104aad8d37d?ref=vikramoberoi.com">You can confirm this easily on your own machine.</a>
<ul>
<li>Remember: Each bit is a distinct user.</li>
<li>1MB stores 8 million bits.</li>
<li>Multiply that by 16 and you get 16MB and 128 million bits.</li>
</ul>
</li>
</ul>
<p>This technique is fast and compresses well, so what&apos;s the catch?</p><p>There are two:</p><ol>
<li><strong>Sparse bitmaps are space-inefficient in Redis.</strong> Redis will allocate as much memory as it needs to set the Nth bit. If your user&apos;s ID is 8,000,000, bitmapist will set that bit in Redis, which will allocate 1MB of RAM to set it. You can see this behavior in action <a href="https://gist.github.com/voberoi/9fb7affa5e3d2ae3aaef7104aad8d37d?ref=vikramoberoi.com">here</a>.</li>
<li><strong>This technique stores lots of bitmaps.</strong> Let&apos;s say you track 100 distinct events and you bin hourly (24 * 30 bins/month), daily (30 bins/month), and monthly (1 bin/month). Worst case, you&apos;ll have ~75,000 bitmaps by the end of the month. If each bitmap is 1MB, Redis will allocate 75GB RAM.</li>
</ol>
<p>As of 2020, <a href="https://www.kieranflanagan.io/blog/how-todoist-went-from-a-side-project-to-millions-of-weekly-active-users?ref=vikramoberoi.com">Todoist had tens of millions of signups and over one million active users</a>. That Todoist might set a bit index in the tens of millions range every day while setting few bits at a lower range (0 - 1,000,000) is highly likely.</p><p>Luckily, there&apos;s an easy fix for this that relies on more efficient bitmap representations.</p><p><strong>Using </strong><a href="https://www.vikramoberoi.com/a-primer-on-roaring-bitmaps-what-they-are-and-how-they-work/"><strong>roaring bitmaps</strong></a><strong> fixes the space-inefficiencies introduced by big, sparse Redis bitmaps.</strong></p><p>In 2017 Doist built <a href="https://github.com/Doist/bitmapist-server?ref=vikramoberoi.com">a standalone bitmap server</a> to address the memory issues described above. It implements Redis&apos; wire protocol and the subset of Redis bitmap commands used by bitmapist so it&apos;s easy to swap in.</p><p>It is also <em>three orders of magnitude more space-efficient</em> than using Redis bitmaps under a heavily used bitmapist setup:</p><blockquote>Memory in use reported by Redis (matches RSS of the redis-server process): 129.48G.<br><br>With the same dataset migrated to standalone bitmapist server under the same load: RSS reported at about 300M.</blockquote><p>It achieves this result by using roaring bitmaps, compressed bitmaps that still offer high-performance bitwise operations.</p><p>If you&apos;re curious about Roaring bitmaps, I read the papers and wrote <a href="https://www.vikramoberoi.com/a-primer-on-roaring-bitmaps-what-they-are-and-how-they-work/">a primer on what they are and how they work</a>.</p><p><strong>Interactive retention analyses over billions of events for under $100/mo?</strong></p><p>I haven&apos;t tested this out but it&apos;s well within the realm of possibility:</p><ul><li>In 2015, <a href="https://medium.com/hacking-and-gonzo/bitmapist-analytics-and-cohorts-for-redis-44be43458ef6?ref=vikramoberoi.com">Doist was storing hundreds of millions</a> events using bitmapist.</li><li>In 2017, Doist&apos;s <a href="https://github.com/Doist/bitmapist-server?ref=vikramoberoi.com">roaring bitmap-based server</a> compressed at least hundreds of millions of events from ~130GB to ~300MB.</li><li>A factor of 10 increase in the above event volume is billions of events. Assuming memory requirements scale linearly, you&apos;ll need 3GB RAM.</li><li>Many AWS EC2 instance types will fit 3GB comfortably in physical RAM for under $100/mo.</li></ul><p>I may test this out with the <a href="https://www.gharchive.org/?ref=vikramoberoi.com">Github Archive</a> dataset (~3 billion events) one day. If you take a crack at testing these assumptions before I do, please let me know so I can link to your post.</p><p>And if you work at Doist, let us know what hardware you run this on!</p><p><strong>This technique nets you affordable and efficient distinct counts, set operations, and retention analyses at scale. But the benefits end there.</strong></p><p>Spending $100/mo for interactive retention analyses over billions of events is incredibly affordable.</p><p>With an event volume in the 10&apos;s to 100&apos;s of millions per month you&apos;d be forking over $1000&apos;s/month to <a href="amplitude.com">Amplitude</a>, <a href="heapanalytics.com">Heap</a>, or <a href="mixpanel.com">Mixpanel</a> today. Amplitude&apos;s free tier ends at a 10MM event/mo. At that point, their annual contracts begin at ~$30K/year.</p><p>But these companies also provide a useful suite of tools beyond retention analyses that the techniques underlying bitmapist do not support. If you need funnel or path analyses, for example, you&apos;ll need to pony up the cash &#x2013; this &quot;swiss army knife&quot; model of product analytics unfortunately commands a premium.</p><p><strong>Further reading</strong></p><ul><li><a href="https://medium.com/hacking-and-gonzo/bitmapist-analytics-and-cohorts-for-redis-44be43458ef6?ref=vikramoberoi.com">The blog post introducing bitmapist</a></li><li><a href="https://github.com/Doist/bitmapist?ref=vikramoberoi.com">Doist&apos;s bitmapist library on Github</a></li><li><a href="https://github.com/Doist/bitmapist-server?ref=vikramoberoi.com">Doist&apos;s roaring bitmap-backed bitmap server</a></li><li><a href="https://roaringbitmap.org/?ref=vikramoberoi.com">roaringbitmap.org</a></li><li><a href="https://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/?ref=vikramoberoi.com">An article from 2011 on real-time metrics using Redis bitmaps</a></li><li><a href="https://redis.io/docs/data-types/bitmaps/?ref=vikramoberoi.com">Redis bitmaps documentation</a></li></ul>
<!--kg-card-begin: html-->
<span style="font-size: 0.9em; font-style: italic">Thanks to Gaurav Oberoi, Chuck Groom, and Pierre Jambet.</span>
<!--kg-card-end: html-->
<hr><p>Questions? Comments? Feedback? Please send me a note!</p><p>Email me at <a href="mailto:voberoi@gmail.com" rel="noreferrer">voberoi@gmail.com</a> or holler at me on <a href="https://x.com/voberoi?ref=vikramoberoi.com" rel="noreferrer">Twitter</a> or <a href="https://www.threads.net/@vikramo?ref=vikramoberoi.com" rel="noreferrer">Threads</a>.</p><div class="kg-card kg-signup-card kg-width-regular " data-lexical-signup-form style="background-color: #F0F0F0; display: none;">
            
            <div class="kg-signup-card-content">
                
                <div class="kg-signup-card-text ">
                    <h2 class="kg-signup-card-heading"><span style="white-space: pre-wrap;">Enjoyed this post?</span></h2>
                    <p class="kg-signup-card-subheading"><span style="white-space: pre-wrap;">I write ~2 times a month on technical topics just like this one.</span></p>
                    
        <form class="kg-signup-card-form" data-members-form="signup">
            
            <div class="kg-signup-card-fields">
                <input class="kg-signup-card-input" id="email" data-members-email type="email" required="true" placeholder="Your email">
                <button class="kg-signup-card-button kg-style-accent" style="color: #FFFFFF;" type="submit">
                    <span class="kg-signup-card-button-default">Subscribe</span>
                    <span class="kg-signup-card-button-loading"><svg xmlns="http://www.w3.org/2000/svg" height="24" width="24" viewbox="0 0 24 24">
        <g stroke-linecap="round" stroke-width="2" fill="currentColor" stroke="none" stroke-linejoin="round" class="nc-icon-wrapper">
            <g class="nc-loop-dots-4-24-icon-o">
                <circle cx="4" cy="12" r="3"/>
                <circle cx="12" cy="12" r="3"/>
                <circle cx="20" cy="12" r="3"/>
            </g>
            <style data-cap="butt">
                .nc-loop-dots-4-24-icon-o{--animation-duration:0.8s}
                .nc-loop-dots-4-24-icon-o *{opacity:.4;transform:scale(.75);animation:nc-loop-dots-4-anim var(--animation-duration) infinite}
                .nc-loop-dots-4-24-icon-o :nth-child(1){transform-origin:4px 12px;animation-delay:-.3s;animation-delay:calc(var(--animation-duration)/-2.666)}
                .nc-loop-dots-4-24-icon-o :nth-child(2){transform-origin:12px 12px;animation-delay:-.15s;animation-delay:calc(var(--animation-duration)/-5.333)}
                .nc-loop-dots-4-24-icon-o :nth-child(3){transform-origin:20px 12px}
                @keyframes nc-loop-dots-4-anim{0%,100%{opacity:.4;transform:scale(.75)}50%{opacity:1;transform:scale(1)}}
            </style>
        </g>
    </svg></span>
                </button>
            </div>
            <div class="kg-signup-card-success">
                Email sent! Check your inbox to complete your signup.
            </div>
            <div class="kg-signup-card-error" data-members-error></div>
        </form>
        
                    <p class="kg-signup-card-disclaimer"><span style="white-space: pre-wrap;">No spam. Unsubscribe anytime.</span></p>
                </div>
            </div>
        </div>]]></content:encoded></item></channel></rss>