Small Data 2024 Review

This week I went to the Small Data conference. It was really interesting, here are some rough notes.

Big Data is Not a Number

Jordan Tigani, MotherDuck
Runs over the history of cloud systems development: horizontal scale instead of vertical scale, tools to manage emergent complexity, frameworks to manage complex tools. Big data taxes: there’s a minimum entry point of complexity. Separation of storage and compute, neat, but storage grows faster than compute. That’s the seed of the idea that maybe big data isn’t right. As a Google engineer, Jordan would run a petabyte scale query against BigQuery and get the applause, yay, but value and volume are inversely proportional. Interest in NoSQL is flattening out. Mongo still kicking but Postgres is winning, and “NewSQL” like cockroach or aurora or duckdb is doing just fine. Big Data doing fine in OLAP but Duck’s coming up. “I didn’t tell you back then… that petabyte Google query cost $5,580 to run”. We could make big data work, but it was too expensive to run outside of extraordinary circumstance. The key driver of cost is the size of hot data. Working set size is what matters, not total set size. Working set is rarely big data. So, revenge of the single node, return of vertical scale. Single node doesn’t pay the Big Data Tax of complexity and coordination.

Data size isn’t as big as we thought
Parallel execution isn’t as cheap as we thought
You can run a terabyte warehouse on your laptop today. The threshold of “what is big data” is going up all the time, because the single node keeps getting bigger and better.
If latency is important, do the work close to the source.
If cost is important, do the work where it’s cheapest.
Simplicity is better than scalability. Premature optimization is root of all bugs, premature scale is root of complexity. Simpler is faster.
You don’t have to be on the cloud.
Databricks vs Snowflake had a disagreement in 2021 over 100TB dataset performance (this reminds me of the late 90’s when Oracle and Sybase would put competing performance claims on billboards along the 101). Jordan’s experience with Google was that no one actually needed that scale… again, working set vs total set. Blog post led to excitement, led to this conference. “How many people in this room are startup founders?” Looks like 3/4s out of maybe 150 people in the room, conference registrations were capped at 250.

Retooling for a Smaller Data Era

Wes McKinney, Posit
Started Python pandas back in the day, Apache Arrow, Python Ibis project. Recently launched Positron IDE for Data Science. Data Size is relative. What used to be Big Data is now phone sized or laptop sized. In 2004 when the Map Reduce paper was written, typical Xeon server had one core. Now a typical server has 96 cores. Clock speeds haven’t gone up in 20 years, but core counts and parallel capabilities have. Exponential network performance increase too, 3 orders of magnitude from 2008 to 2024.
In last 20 years, two parallel worlds have developed: Big Data Ergonomics and Small Data Ergonomics. Dynamite mining versus watch making. 2015 Microsoft paper “Scalability! But at What Cost” talks about the Big Data tax. But if you look at the queries being used, it’s the same stuff. Why this dichotomy in scale / efficiency / ergonomics? Because if scalability is your top need, then performance and efficiency and composability are the secondary needs.
What is a Composable Data System? One that puts composability at the top of the priority stack. Modularity, interoperability. Resist vertical integration and hyper specialize in local problems (as in the semiconductor industry). NYC R Conference 2015 talk prediction. 2017 conference asking why python and R and java have different stacks for same thing. And so, Apache Arrow in 2016: a cross-language fast in-memory format. DuckDB is a cutting edge analytical columnar data engine, but embeddable. You can compile it down to WASM and stuff it into your browser, don’t need to serialize/deserialize to exchange with long-term storage.
SQL was supposed to be a standard, but dialects are non-portable and engine choice is non-trivial. Same problem is coming up in Python around data frames (pandas, polars, spark data frames, ibis).

Evolving DAG for the LLM world

Julia Schottenstein, Langchain
What makes non-deterministic LLM interesting is that if you can tell it what was wrong in its prior attempts it can get better. Diagram from the CodiumAI Flow paper… that’s what “Agent” means to Langchain. But moving from pre-planned to directed cyclic graph agentic makes reliability and performance and cost very unpredictable. Planning and Reflection steps are more productive of success than winging it, so need to bake those steps in. Memory is a challenge (user communicates to supervisor which hits agents, that’s a state transmission problem where you need persistent shared memory). Task ambiguity and misuse (not just malicious use, but the supervisor making poor agent choices) are problems, and of course basic non-determinism in the guts.
So Langchain is trying to solve all that: use a graph to balance agent control with agency. Controllability, persistence, human-in-the-loop RAG guidance, Streaming for visibility into events and latency reduction. Demo of a router node with some agents and a response collator: router does plan, collator does reflection. Left pane gives graph viz, right pane gives the intermediate Q&A between nodes and things. They call it Directed Cyclic Graph. DAG -> DCG.

Give every user their own database!

Glauber Costa, Turso founder/CEO
Go multi-tenant to the user level with MotherDuck and Turso. What does a database do? OLAP (Online Analytical Processing) or OLTP (Online Transaction Processing). OLAP is what DuckDB is good at. Long running, lots of data, not sensitive to latency, very sensitive to throughput. Think BI (Business Intelligence). OLTP wants short queries, very latency sensitive and throughput sensitive. Think APIs. Read-intensive, maybe not as good at writing. A modern OLTP wants compliance with privacy regulations. PHI, PII, right-to-forget, sovereignty. Even if you don’t have legal requirements, why keep potentially dangerous stuff lying around. More isolated equals less ability for massive theft. Per-user data is really easy to isolate, encrypt, delete, or deliver (can hand over to the user or LEO). Also, developer velocity! What if your developers were not terrified of the blast radius from breaking everything? What if you could recover a few affected users from individual backups instead of having to restore the entire world to correct a mistake? Replication means you can deploy up-to-date cache into the places it needs to be, and isolation means you don’t have to be perfect. Poor performance is okay when you’re lightly using it and everything’s local. No N+1 query problems. So pick a boundary that shards naturally, like a user’s account or a customer’s account or an application page. Or a time range, use a database file per day (and then you can manage hot storage vs cold storage under them). No more concern about row level security or locking because the volumes are lower, don’t need a dedicated cache layer because you’ve taken performance related slowness out of the system.
Why not do it? Data distribution drives cost. Each user has their own VM or container is expensive. And reality of customer distributions is that you’ve got a few whales and a long tail following the power law distribution, and they’re uneven over time. E.G., you might have a customer that usually has very low use but pounds the hell out of their data for a week per quarter. So you’ve got a boundary at which some users are using the system too little to justify the cost of continual uptime. Enter sqlite, the simple filesystem database with lots of control over resource use (cache size and heap limit pragmas and progress handler). It can handle up to 200TB, 2k writes per second, or it can be tiny and cheap. It doesn’t have to be running all the time. Turso started without multi-tenancy, followed their customers there. Idle databases cost zero compute, minimal storage.

Build bigger with small AI: running small models locally

Jeff Morgan, Ollama founder
Former Docker, realized docker desktop was on 20m desktops and the local ergonomics was a major driver of its success. So, how to make local ergonomics a thing for LLM developers? Small models are small: fewer parameters, fewer gigs, can run on phones and laptops. Not commercially gated, might be OSS, yay. There’s a misconception that the small model is just a worse version of the big commercial one, but that’s not true:

better performance because it’s small and local instead of gigantic and remote. Higher throughput = instant response, no round-trip, low latency.
you control your data. Paranoia, sure, but the fact that you don’t have to ask for permission or check for compliance.
using modern hardware you already own is cheaper than leasing someone else’s computer.
no switching cost, no integration cost, no big data complexity tax
Small data benefits are 1:1 for small AI models. There’s already thousands of models, check out Llama, Gemma, Phi. Lots of versatility for customization. They don’t really shine as ChatGPT alternatives, they’re stronger at use-for-purpose. Also really good for use with RAG because you’re controlling what data it gets and how the vectors align and where they’re stored. Small local models are therefore really good for making agents. Idea is to enable models to run code and call data: like what if the RAG stuff is in a database instead of in memory, and you enable Ollama with langchain to look in DuckDB for contextual help before it answers. In demo, model constructs SQL queries and executes them. Advice: instead of yet-another-customer-facing chatbot, maybe check out internal administration tasks in product operations, finance, legal, and human resources for low-hanging fruit.
I installed Ollama on my four year old MBA M1. I asked it how many chucks can a woodchuck chuck. It slowly gave a pretty cromulent answer and then locked up and needed a hard restart.

Squeezing maximum ROI out of small data

Lindsay Murphy, Women Lead Data (and Hiive)
I lost some notes from this talk because Ollama. Constraints make teams successful; big data has grown out of abundance thinking and led to massive complexity. Constraint of resource makes you focus on the successful parts of the success quadrant. Ignoring constraints stifles innovation. Reviews stakeholder quadrant and iron triangle. Less than fifty percent of this room are formally trained engineers. Post-ZIRP (Zero Interest Rate Policy), maybe you can’t rely on the services you have built on staying extant. You might have gotten hooked on a subsidized pricing model that is going to go away. The value that we deliver and the cost of delivering it are what matter, how do you present that to your stakeholders.

BI’s Big Lie

Benn Stancil, founder of Mode (acquired by ThoughtSpot) -> independent thot leeder and Harris-Walz campaign.
Talks about giving BI demos. Find the problem, stuff your thumb in it, find the solution in the customer’s data, show it to them. Quotes Periscope Data’s pre-acquisition About Us page: “excited about that weird blip in the data that gives us insight”. It’s analyst as Sherlock Holmes, finding needles in haystacks. “Drill, baby, drill” through the data.

This story sells. You can find it in Power BI and Tableau’s materials too. There’s insight in your data if you can just drill it right.
But it doesn’t really work. Dashboards? Trashboards. Nobody uses the tool in production after it’s made.
Brittany Davis, in The Revolving Door of BI, claims “companies swap out their tools every 2-3 years because the existing one doesn’t live up to the pitch.” Repeated attempts over 20-30 years, why is the product not working and why doesn’t the sales pitch quit working? NYT 2012 article about Target as a big data company doing prophecy from the data (later partially debunked), the Moneyball book and movie, Nate Silver, Chamath Palihapitiya, Data Science articles, data is the new oil, everybody’s excited about extracting new value from refinement of old data. Big data analysis gives you really smooth and believable charts that real world use cases can’t do. When you’re smaller than Target or Facebook, maybe you don’t get incredibly accurate predictions, you get vibes.
BI tools might oughta focus on recipes for interpretation instead of exploration and drill
Do things that don’t scale. Look at the data, use Excel, don’t go to models when law of small numbers is going to win. Look at the value which is probably in small data.
If you want to go do big data, consider doing it where it’s actually necessary and valuable

Know thy customer: why TPC is not enough

Gaurav Saxena, AWS Redshift principal engineer
BA use cases evolution over time, new use cases are not adequately represented by TPC benchmarks, RedShift does do it. In the 80’s, you’d go to your DBA. In the 90’s, you get Business Objects, Cognos, BA’s can generate their own SQL statements and iteratively explore. DBAs become data engineers curating datasets for analysis. TPC-D (1994), TPC-H (1999), TPC-DS (2012). Lots of details, but he’s saying that ML workloads are different. I don’t have strong opinions about TPC. TPC Benchmarks Overview. There’s a Redset dataset about how redshift is used in real life, that’ll be very interesting.

Paddling in circles: Return of Edge computing

Richard Wesley, DuckDB Timelord
Background in Tableau, did the temporal data analysis functions. Potted history of computing up to 2020s… Distributed hardware is powerful now. IOT stuff (two brands of smart meters that have app stores!), laptops, processing-in-memory. Software side for centrally managed remote stuff has exploded as well.
While at Tableau, did the database connectivity layer, huge win, but… slow, often because the database was heavily loaded. So maybe they could make views of what the user wanted? But still performance limited, and permission limited… so what if they pulled the view down to the local machine? (Looks over at Microsoft Access). Oh hey, now you can take your data with you, that’s a big deal! But, not everyone had an Office Pro license, and some people didn’t use Windows. What about Firebird? Looked cool, but didn’t scale well, and wow, columnar databases are neat! So they tried MonetDB. 200x faster! Jumped in with both feet, but no one else in the company could maintain it…. On to HyPer Database from TUM, and they acquihired a team to maintain it. Still really hard to develop and use features though, this stinks. So on to DuckDB. Memory hierarchy: tuple at a time, column at a time, vector at a time. Going to disk when working set is too big for memory is another gigantic challenge to solve. Reference to academic papers used recently (too fast for me to follow). They also use the community for plugin development.

Pysheets, Spreadsheet UI for Python

Chris Laffra, PySheets Spreadsheet UI for Python
Why? The tech stack for data science is Python, Jupyter notebooks, R. But business people don’t do that stuff, they do spreadsheets and documents. Notebooks are one-dimensional, linear path of thought. Spreadsheets are multi-dimensional graphs, it’s DAG.
Why not just use Python in Excel though? Because Microsoft charges you 25 bucks over your O365 license for it, and it’s actually running in Azure Cloud, not in Excel. How about we run everything in the browser instead? Two WASM VMs, the UX on microPython and pandas on cPython. Copy some Google sheets stuff and paste it in. Ask for some matplotlib charts, ChatGPT makes python and off it goes. Demo broke first time, OpenAI lol, restarted and it worked. Wild stuff.
Started this ten months ago on Firebase and GAE, but a customer said no to Google… so he ported it to Mongo on Digital Ocean, but they wanted real SQL, so he went to MySQL… and then thought wait a minute, sqlite and WASM, make it ALL LOCAL. Bank customer is happy, but wants some AI. He thinks Ollama, tries it out, blew up his laptop too, lol. That’s still a glorious future option, but OpenAI is what works today. Tried Gemini, not working yet. “This is what Jupyter Notebooks should have been”. Fintech and banking users use graphs and spreadsheets and this is the bomb for them. Rad stuff.

Small data by the numbers: Fireside chat

George Fraser, Fivetran CEO and Jordan Tigani
Small data is most important trend. Fivetran is ten years old, does data replication and has been observing that the data is typically fairly small. Even though customers are huge and they do ask for ginormous datasets to be replicated, most big datasets are actually a symptom of poorly designed datasets. If you only do deltas to get the important stuff, you get much smaller datasets. Reference to Redset dataset, there’s something similar for Snowflake too (called Snowset). George’s analysis of those datasets (with DuckDB!) showed very consistent results between them. The average query is astonishingly small, center of the distribution is about 64MB. 30% of workload is ingest processing. “We’d seen signs of that in Fivetran but I just didn’t believe it”. Fastest growing (0 to 1.5% of revenue over six months) destination is data lakes, into Iceberg format or Delta. Fivetran does the computation for those, so they built a special purpose ingest pipeline for it and were able to reduce price for that workload. Source -> Sink alignment FTW. Big Data to Small Data echoes Mainframe to PC revolution. Open table formats will seriously open things up. Small queries from laptop execution to open datasets. Catalog support is what’s changing to make it more user-friendly, platforms are getting better every week. CPU and network capability are growing faster than dataset size (especially if useful). “My Redset and Snowset analysis took 7 seconds to scan the combined dataset with DuckDB on a Mac Studio”. The data lake transform service is DuckDB powered. They had all that SQL code that already existed, DuckDB let them use that against Parquet files without starting everything over.

Where Data Science meets Shrek: Buzzfeed AI use

Gilad Lotan, BuzzFeed data team lead
Using a lot of LLM and image gen stuff now, prepping datasets with editorial view and constraints. Not replacing people. BuzzFeed makes content, posts on lots of platforms. Like to drive interactions: polls, surveys, shopping redirects to Amazon and Shopify. Big newsroom in HuffPost. Tagging and categorizing content is difficult: sports and politics is one thing, but weird news is ya-know weird. The old way: TF-IDF, Universal Sentence Encoders, DistilBERT. New stuff: Nomic-embed-text-v1 is a “small” embedding model that has a context length of 8192 and 124M parameters. Compressed knowledge that maintains semantic connectivity. The vectors inside a neural net approach. Title and contextual blurbs of an article go in and out. For images, editors played around with Midjourney, 2023 “Barbie’s Dream House in Every State” viral article got picked up on Instagram and TikTok, clear signal. So they did it again with their own stable diffusion xl (SDXL) based model and posted a “turn celebrity into Shrek” image generator tool. SDXL uses LoRA to keep most parameters unchanged and only fine tune the impactful ones. Easy to overfit though, small number of input images. Bayesian A/B headline and image testing system has been running for years, so there’s lots of data… trained through HuggingFace transformers and a Claude 3.5 Sonnet model to tournament down to a winner. Historical data that’s unique to us, broader set of options than we would have produced otherwise, gives editors more breadth to work. Most companies are making subpar, static versions of what they had before. “Reminds me of when TV came out and broadcasters would play single camera recordings of plays.”

Building large apps with tiny databases

Søren Bramer Schmidt, Prisma CEO
Will AI wipe out software developers? No. Thesis is that there is huge demand for SWEs. Analogous to Photoshop: photo editing was possible before, but now anyone can do it and professionals still exist (in fact there’s way more people doing image manipulation). Linear, Apple Notes, Figma: three apps where data is local and the vendor does synchronization and backup. They’re doing Local First Architecture.

Cloud first: Device talks to service (maybe close by) which stores data (probably in US-East-01). Then you add a bunch of complexity to make it suck less.
Local first: fast applications, always available (even when us-east is down), good for collaboration and privacy. Also good for developers: simpler architecture, the distributed system problems are more contained, it’s easier to develop and test and debug. No weird caching problems.
Prisma ORM goal is the best database developer experience. “Interesting mix of data science people and developers here”. Three workflows: create, query, modify. Screenshot of ticket asking for React native in early 2019. Easy, declarative components. Screenshot of code to load some data, map it and loop, delete the transaction onLongPress. No state management. Demo’d at App.js conference, cuts typical data interface app code in half.
The missing piece though is a seamless data sync layer. Prisma roadmap.
Change propagation service: from changer to storage and other concerned clients.
Data types that know how to merge (CRDTs). Rich Text editing as an example. But they’re tough to work with. Do with an authoritative server instead:
Local change reports to server
Server calculates a diff
Server distributes diff to all clients
clients rebase if needed (ex: two clients make the same change, one of them has to rollback).
Data leak prevention: make sure all concerned clients are legit.
Notion as example. Documents are RBAC controlled. Can’t share documents to users who don’t have rights to them. Want an easy user experience and developer experience, so model needs to be simple to understand. Two ways to get there:
- Query shapes (like ElectricSQL)
- Partition data on access (more like Turso). You either have full access to a document or no access. Apply this model to lots of apps, Slack channels, Figma design docs, Prisma queries.
  Neat thing is this model fits to AI model better. Slack had a prompt injection leak last month (see Dark Reading) because data and its RBAC boundary was at workspace level instead of channel level.

Enhancing scalability and usability of visualization toolkits for data understanding

Junran Yang (UofW grad student (research gallery))
To make a good visualization, you need to understand Task, Data Type, Domain, and Data Scale. Perceptual scalability breaks down, so you summarize.
When you have a lot of data and calculation to do, performing scalable interactivity is tough to do with server-side round-trips. Heer and Moritz Mosaic demo UIST ’23.
Strategies for optimization are design dependent. Client-side indexing or data cubes, pre-fetching, brushing and linking as a way to zoom and pan, detail-on-demand after interactions (zoom and pan).

Perceptual and interactive scalability are limited by resolution, not volume
Interface constrains query scope, not the other way around
Users interact in predictable ways, so give them rails they’ll want to follow.
Visualization pattern recommendations: Voyager, Tableau (ShowMe), Foresight. Looks at the data and recommends a pattern that should work for it. Can also offer a gallery of examples and documentation, quicker comparison between designs, performance profiling of visualization types. Supporting scalable interactive data exploration is a human factors problem, work with users.

Data minimalism: business value for the 99%

Panel: Josh Wils, James Winegar, Celina Wong, Jake Thomas, Ravit Jain (moderator)
What is small data?

Large workloads sharded into digestible chunks. Compliance and safety.
Pendulum swing back from cloud-as-mainframe. Laptops are good now. When you’ve got 256 H100s, every one of them is precious. Not horizontally scaled any more.
Hot working sets, the data that actually drives business value. The stuff that’s driving decisions, not the stuff that’s stored for later.
How do you identify the data that matters?
Right-sized data is about strategic planning. Celina: “I’ve been head of data three times, it’s cool to play with tools. But: do you know how you’re making money and what you’re spending? The data that answers P&L is the data that matters. You don’t get a gold star for developing things that don’t matter.”
Data quality over quantity?
Recent is more valuable, so don’t look at old stuff unless you need it. Define requirements carefully.
Tabular pitch (separate storage and compute, use compute scaled to the task at hand).
siloed data makes it impossible to reason about quality and quantity because you’ve got what you can access in the silos you can reach. Example: Tracking the outcomes of ads for MMM (Media Mix Modeling), P&G had years of big data stored but the approach wasn’t necessary to get the job done. “How do I use smaller sets to just do the work instead of using gigantic sets to get worse outcomes.” Everyone’s dancing around human legibility, seems to me.
Big data used to seem like the only option for working with lots of small data sources. Tools like duckdb now make it much more feasible to just work with lots of small data sources. Better interoperability between pluggable systems, easier to bring compute to where the data is.
Worth noting that horizontal elasticity and reliability of big data clouds is valuable, and losing a node from a small data situation is a significant problem.
But engineering complexity is a cost thing too. Big batch jobs were easy to figure out but expensive to run, incremental stream execution is hard to figure out but cheaper to use.
Slack anecdote, “we joked were able to predict our AWS costs by how many engineers we hired, forget clients”
Celine suggests a Marie Kondo move, “does this data project spark joy”. Josh: “Nobody in data uses Spark and joy together in a sentence”

Wrap-up thoughts

I was the only observability or security person there, as far as I could tell. Had to explain what Observability was about to developers a few times, but I also met a goodly number of more experienced folks who knew about big data and big systems. A lot of attendees are intrigued at the possibility of pendulum swinging back.
Back in the dotcom boom days, managed service providers (MSPs) and application service providers (ASPs) were the equivalent of today’s IaaS and SaaS vendors. When the crash hit in the early 2000s, customers pulled back to self-hosted colocation and self-written code on frameworks because owning and depreciating was cheaper and more predictable than renting. Economic recovery reversed that pattern really quickly, and now few people remember it, but it sticks for me because I worked at an MSP that supported ASPs and our business disappeared overnight. I’ve been idly curious to see if the pattern would repeat, but the last few economic shocks haven’t really touched the world of IT. That seems to be changing now, with two years of record-setting layoffs rocking the industry and interest rate change back on the table. I found the technical reasoning for small data to be compelling, withholding judgement on the soundness of its business logic at this time.

Monkeynoodle.Org