Can AI extract data from a messy spreadsheet?

May 2, 2024

Slice & Dice

👋 We’re sharing how we built Slice & Dice, an experiment to use AI to extract data from all kinds of messy spreadsheets.

TL;DR: When AI succeeds, it feels like magic. But unfortunately agents are slow and only work occasionally. There are more strategies worth exploring, but maybe we need to wait for GPT-5?

View our demo | Code on Github


Can any lab exist without spreadsheets? Spreadsheets are the backbone of many labs — they act as a database, a data pipeline, and an analytics tool for all kinds of scientific data. But extracting insights — even with computers — has always been a hard task.

For software to analyze spreadsheets, the data needs to be in a format that software can easily read. The data needs to be tabular, or organized into tables of rows and columns. Each row needs to represent a single record, and each column needs to represent a specific attribute of the record. The first row usually also contains descriptive column headers.

A simple example of a well-formatted table of data

Unfortunately, spreadsheets found in biology can often contain multiple tables of data, sheets with multiple sections, data laid out visually instead of logically, empty rows or cells, comments, and other pieces of information. Here’s a more realistic example from a Tecan Spark microplate reader:

A real example generated by a Tecan Spark microplate reader. These types of spreadsheets are very difficult for software to analyze.

These kinds of spreadsheets usually prevent most software from properly extracting and analyzing data, because the software expects strict patterns and regularity. For us to build better tools for biology, we want a reliable, automated, high-throughput way of extracting data from any spreadsheet.

In our last experiment, we released platechain, a tool that uses LLMs to parse and extract data from spreadsheets generated microplate reader. While platechain works great for microplate data, it has several limitations, like only working on standard plate sizes and rectangular plate layouts.

We wanted to see if recent advances in AI would allow us to extract data from any arbitrary kinds of messy, non-tabular spreadsheet data.

Starting with prompt engineering

The first thing we did was see if OpenAI’s Data Analyst GPT (sometimes called “Code Interpreter”) could help us understand and extract the data. Data Analyst works by writing code in the background to attempt to analyze a file.

Using the Tecan Spark Excel sheet as an example, we uploaded the file to ChatGPT, and asked it to “explain the data” — and it was able to generate a full explanation of the data, including what generated the data. Asking to isolate the measurement data, it was able to generate the data table below (see the full chat here).

Data Analyst was able to identify the type of data, what’s in the spreadsheet, and was even able to extract the measurement data and save it into a separate CSV. We were able to do this without any special prompt engineering! This encouraged us to explore building a higher-throughput, messy spreadsheet data extraction system, that we can use to process thousands of spreadsheets across thousands of labs.

Prompt engineering on ChatGPT Data Analyst works incredibly well. It doesn’t hallucinate much, because it creates summaries based on the code it write and generates, and after a few chats, it can generally get the right information. Instead, we wanted to see if we could get rid of the chat portion completely.

Building a web application

We built Slice & Dice, a web application as a platform to test different extraction strategies

We built Slice & Dice as a demo website to test different spreadsheet extraction strategies. It’s a simple website that converts user-uploaded Excel and CSV files to a format easily understood by language models. We wanted to see if current AI models could not only get the right answers from spreadsheets, but do it in a way that’s both fast and fun for the user.

Technical side note: We built a Slice as a fully open source project, separate from our main product. We think spreadsheet parsing as a community effort, and something everyone will benefit from. Under the hood, we deploy our AI-based services as separate applications and serverless API endpoints, that our main product at sphinxbio.com can use.

Making Data Analyst High Throughput

While ChatGPT’s Data Analyst is great for one off analyses, it has many drawbacks: It often requires users to chat back and forth before getting answers; it often crashes because it fails to run the code it writes; it’s generally very slow; it’s hard to integrate into larger workflows. In general, we found it to be quite unpredictable and not a good fit for what we needed.

We wanted to find a platform that let us build our own Data Analyst that would be more reliable and didn’t require a user to “chat” with the model.

LangChain is a platform for building your own LLM-enabled projects

ChatGPT Data Analyst is a type of AI that’s commonly referred to as an “agent.” It takes a few requests, generates some code, runs the code, then creates the answer. It’s able to “decide” what code to write and “reflect” on whether the answer looks correct. Instead of Data Analyst, we wanted to build our own, so we turned to LangChain, a leading open source project that helps users build their own LLM-enabled applications.

Using LangChain to build production-ready agents turned out to be quite complicated: sometimes the agents would take several minutes to finish; sometimes they’d get stuck in loops; sometimes they’d write error-ridden code (something that GPT-4’s Data Analyst does quite regularly).

ChatGPT’s Data Analyst often come with analysis errors, which unfortunately extends to LangChain agents as well.

When the agent succeeds, it feels like magic. But unfortunately, that only happens one out of five times.

Our code-generating agent we built on LangChain didn’t seem very reliable. But we weren’t sure if it was our own fault, or if all code-generating agents struggled similarly. Since OpenAI has been heavily pushing their GPTs and Assistants API, we wanted to know if we’d have better luck with a pre-built agent system like Assistants.

Building a Data Analysis Assistant

OpenAI’s Assistants is a way for developers to design custom GPTs in their Playground, then deploy them as software for other code to trigger.

Assistants is a relatively new (still in beta) experiment from OpenAI. It’s a programmatic interface to build experiences like ChatGPT Data Analyst. Developers design assistants through OpenAI’s Playground, and can then trigger them with code.

Building an application on Assistants was surprisingly more complex than basic GPT-4 prompting. Assistants require a suite of file upload, task, and thread runtime management, all of which we had to build separately in order to support the Assistants workflow.

Unfortunately, Assistants had a similar success rate as our custom LangChain assistant; roughly 1 in 5 executions worked. Additionally, Assistants seemed to perform worse than ChatGPT Data Analyst.

Using the same prompts, Data Analyst seemed to generally perform better than Assistants: it writes less buggy code; it gets stuck in infinite loops a lot less; and it doesn’t struggle to do basic things like opening the Excel or CSV file (which Assistants constantly struggle with).

You can test your Assistants using OpenAI’s playground interface. The Assistant seems to write a lot of bad code, and run into a lot of errors.

Assistants were also very slow, and it was hard to know whether they got stuck in an infinite loop, had crashed, or — despite the system prompt and instructions to always output data — got stuck prompting the user for clarification. The last thing about Assistants is that merely using it for testing ended up costing $15/day, while running LangChain cost a few cents. That’s cost-prohibitive for production.

Maybe agents aren’t ready for primetime?

Building with agents is slow and unreliable, and turns prototype development into a waiting game. They require a lot of complexity, can rack up costs quickly, and don’t produce consistent results. What if we didn’t use agents?

What if we could just send the entire spreadsheet as a prompt? With Gemini Pro and Claude Opus, we can fit relatively large spreadsheets, without running into the problem of reaching the maximum “context length.”

Unfortunately, just asking the language model in a single prompt to get the data rarely works, as it’s prone to hallucinations, mistakes. It’s unclear however if it’s the prompt length or table complexity that is “confusing” the language models, so we created two simple datasets on phage titration and phage growth curves to see if we could get the right answer.

Example spreadsheet on phage titration and growth curves. Inspecting it will reveal that it contains four datasets.

These are not as complex as the Spark data, but capture the essence of messy spreadsheets. In the first example, two datasets are placed side by side in a single sheet. In the second example, we have growth curves for two phages. An AI model should be able to recognize that both sheets have multiple datasets.

Given this new test case, we set out to programmatically ask GPT-4 and Claude Opus in a series of prompts to identify the boundaries. Instead of asking them to reformat the data (which could lead to hallucinations), we ask them to identify boundaries for datasets. For example, in the above dataset, A2:B11 and C2:D11 are the boundaries of two separate growth curve datasets.

Directly asking either GPT-4 or Claude to identify boundaries by giving them the CSV representation of data… did not work very well. They would draw boundaries arbitrarily. What did work better though is suggesting to the AI to describe the data, analyze the boundaries, then re-check its own answer until it’s satisfied.

Strategies for finding boundaries for small spreadsheets

Chaining a few prompts together worked surprisingly well! It was much faster than using LangChain or Assistants agents, and generally got better results. However, we still had a problem with slowness. Both GPT-4 and Claude Opus are very complex models, which means they’re very slow. Getting a response from a chain of prompts would still take a few minutes. How can we fix that?

Prototyping with speed

One day after OpenAI released its new Assistants API (April 17th), Meta released its new Llama 3 model. Many were claiming the model performed almost as well as Claude Opus. But even better — it could run on machines as small as Macbooks, and it was openly available — which meant providers like Groq.com would soon provide it. Groq is a compute provider which is heavily optimized for LLMs, allowing them to provide insanely fast inference speeds

Demo of using Groq and Llama 3 to extract data. It runs multiple prompts under the hood. Its speed is measured in seconds while Claude Opus and OpenAI’s GPT-4 is measured in minutes.

Swapping an ultrafast AI for the much slower Opus or GPT-4 unlocks new UX paradigms. Instead of forcing users to wait for an agent to do all of the work, we could instead add a data exploration copilot where you can select any data and it’ll explain what you’ve selected. This type of live exploration could be paired with an export functionality to help you “slice and dice” your data with ease.

Most AI models have focused on increasing the intelligence but not speed. However, really fast inference speeds actually make the product experience better. Previously, the AI models took too long to respond, so we would have very slow feedback loops. But now that the feedback loops are instantaneous we can have the model check and correct its own answers in the background, leading to fewer errors.

What other UX/UI paradigms can be explored, now that AI is instantaneous?

We’re at the beginning of the beginning

Many technologies we used in Slice & Dice — Streaming Assistants, and Llama 3 — have only existed for two weeks. Llama 3 + Groq has opened the door up for new categories of user interfaces and it’s only been around since mid-April.

Slice & Dice is also still at the beginning, and we’re excited to explore several other strategies to pull insights from spreadsheets:

  1. Use common patterns of dirty data to guide the language model. There are a couple of dozen ways that spreadsheet data can be commonly messy. We can use few-shot examples and chained prompts to guide language models to get better results.
  2. Use pattern-based strategies and chained agents — An extension of the above idea using agents. We could use an overseeing agent that directs other agents which specialize in different data patterns. Using a fleet of parallel agents could also help overcome the speed issues.
  3. Provide tools for the model to “navigate” the spreadsheet — We could alternately provide tools for a language model to “explore” the data programmatically. These navigation tools could look like “get the data from row 1” or “get the median from column B”. These tools can give models a “glimpse” of data — like a flashlight — as it searches for the answer. SWE-agent has demonstrated that building special tools and data formats can drastically increase LLM performance.
  4. Design more interfaces for co-creation — With the speed of Llama 3 and Groq, we should explore more tools that help users collaborate and co-create with the AI. Users could highlight the data, or describe the data, or guide the agent, in various ways. Maybe the AI can make suggestions, and the user works with those suggestions as they accomplish their goals.

Spreadsheets are the beginning

All labs work around spreadsheets, and while tools like Benchling have attempted to displace them entirely, no labs have managed to move entirely off spreadsheets.

If we could get any information from a spreadsheet in just a click — in any shape we need: as charts or CSVs or publication-ready tables — then we could spend more time on our hypotheses and less time in our sheets.

We’ve spent a lot of time thinking about getting data from spreadsheets.

If this sort of work excites you, play with our demo, inspect (and improve on — PRs welcome) our code. Say hello@sphinxbio.com or take a look at our jobs sphinxbio.com/careers


View our latest demo at sheets.sphinxbio.com

View our latest code on Github: https://github.com/sphinxbio/sliceanddice

Excited by the idea of better software for scientists?
Let's talk
get started
May 2, 2024

Can AI extract data from a messy spreadsheet?

Share This post -

Slice & Dice

👋 We’re sharing how we built Slice & Dice, an experiment to use AI to extract data from all kinds of messy spreadsheets.

TL;DR: When AI succeeds, it feels like magic. But unfortunately agents are slow and only work occasionally. There are more strategies worth exploring, but maybe we need to wait for GPT-5?

View our demo | Code on Github


Can any lab exist without spreadsheets? Spreadsheets are the backbone of many labs — they act as a database, a data pipeline, and an analytics tool for all kinds of scientific data. But extracting insights — even with computers — has always been a hard task.

For software to analyze spreadsheets, the data needs to be in a format that software can easily read. The data needs to be tabular, or organized into tables of rows and columns. Each row needs to represent a single record, and each column needs to represent a specific attribute of the record. The first row usually also contains descriptive column headers.

A simple example of a well-formatted table of data

Unfortunately, spreadsheets found in biology can often contain multiple tables of data, sheets with multiple sections, data laid out visually instead of logically, empty rows or cells, comments, and other pieces of information. Here’s a more realistic example from a Tecan Spark microplate reader:

A real example generated by a Tecan Spark microplate reader. These types of spreadsheets are very difficult for software to analyze.

These kinds of spreadsheets usually prevent most software from properly extracting and analyzing data, because the software expects strict patterns and regularity. For us to build better tools for biology, we want a reliable, automated, high-throughput way of extracting data from any spreadsheet.

In our last experiment, we released platechain, a tool that uses LLMs to parse and extract data from spreadsheets generated microplate reader. While platechain works great for microplate data, it has several limitations, like only working on standard plate sizes and rectangular plate layouts.

We wanted to see if recent advances in AI would allow us to extract data from any arbitrary kinds of messy, non-tabular spreadsheet data.

Starting with prompt engineering

The first thing we did was see if OpenAI’s Data Analyst GPT (sometimes called “Code Interpreter”) could help us understand and extract the data. Data Analyst works by writing code in the background to attempt to analyze a file.

Using the Tecan Spark Excel sheet as an example, we uploaded the file to ChatGPT, and asked it to “explain the data” — and it was able to generate a full explanation of the data, including what generated the data. Asking to isolate the measurement data, it was able to generate the data table below (see the full chat here).

Data Analyst was able to identify the type of data, what’s in the spreadsheet, and was even able to extract the measurement data and save it into a separate CSV. We were able to do this without any special prompt engineering! This encouraged us to explore building a higher-throughput, messy spreadsheet data extraction system, that we can use to process thousands of spreadsheets across thousands of labs.

Prompt engineering on ChatGPT Data Analyst works incredibly well. It doesn’t hallucinate much, because it creates summaries based on the code it write and generates, and after a few chats, it can generally get the right information. Instead, we wanted to see if we could get rid of the chat portion completely.

Building a web application

We built Slice & Dice, a web application as a platform to test different extraction strategies

We built Slice & Dice as a demo website to test different spreadsheet extraction strategies. It’s a simple website that converts user-uploaded Excel and CSV files to a format easily understood by language models. We wanted to see if current AI models could not only get the right answers from spreadsheets, but do it in a way that’s both fast and fun for the user.

Technical side note: We built a Slice as a fully open source project, separate from our main product. We think spreadsheet parsing as a community effort, and something everyone will benefit from. Under the hood, we deploy our AI-based services as separate applications and serverless API endpoints, that our main product at sphinxbio.com can use.

Making Data Analyst High Throughput

While ChatGPT’s Data Analyst is great for one off analyses, it has many drawbacks: It often requires users to chat back and forth before getting answers; it often crashes because it fails to run the code it writes; it’s generally very slow; it’s hard to integrate into larger workflows. In general, we found it to be quite unpredictable and not a good fit for what we needed.

We wanted to find a platform that let us build our own Data Analyst that would be more reliable and didn’t require a user to “chat” with the model.

LangChain is a platform for building your own LLM-enabled projects

ChatGPT Data Analyst is a type of AI that’s commonly referred to as an “agent.” It takes a few requests, generates some code, runs the code, then creates the answer. It’s able to “decide” what code to write and “reflect” on whether the answer looks correct. Instead of Data Analyst, we wanted to build our own, so we turned to LangChain, a leading open source project that helps users build their own LLM-enabled applications.

Using LangChain to build production-ready agents turned out to be quite complicated: sometimes the agents would take several minutes to finish; sometimes they’d get stuck in loops; sometimes they’d write error-ridden code (something that GPT-4’s Data Analyst does quite regularly).

ChatGPT’s Data Analyst often come with analysis errors, which unfortunately extends to LangChain agents as well.

When the agent succeeds, it feels like magic. But unfortunately, that only happens one out of five times.

Our code-generating agent we built on LangChain didn’t seem very reliable. But we weren’t sure if it was our own fault, or if all code-generating agents struggled similarly. Since OpenAI has been heavily pushing their GPTs and Assistants API, we wanted to know if we’d have better luck with a pre-built agent system like Assistants.

Building a Data Analysis Assistant

OpenAI’s Assistants is a way for developers to design custom GPTs in their Playground, then deploy them as software for other code to trigger.

Assistants is a relatively new (still in beta) experiment from OpenAI. It’s a programmatic interface to build experiences like ChatGPT Data Analyst. Developers design assistants through OpenAI’s Playground, and can then trigger them with code.

Building an application on Assistants was surprisingly more complex than basic GPT-4 prompting. Assistants require a suite of file upload, task, and thread runtime management, all of which we had to build separately in order to support the Assistants workflow.

Unfortunately, Assistants had a similar success rate as our custom LangChain assistant; roughly 1 in 5 executions worked. Additionally, Assistants seemed to perform worse than ChatGPT Data Analyst.

Using the same prompts, Data Analyst seemed to generally perform better than Assistants: it writes less buggy code; it gets stuck in infinite loops a lot less; and it doesn’t struggle to do basic things like opening the Excel or CSV file (which Assistants constantly struggle with).

You can test your Assistants using OpenAI’s playground interface. The Assistant seems to write a lot of bad code, and run into a lot of errors.

Assistants were also very slow, and it was hard to know whether they got stuck in an infinite loop, had crashed, or — despite the system prompt and instructions to always output data — got stuck prompting the user for clarification. The last thing about Assistants is that merely using it for testing ended up costing $15/day, while running LangChain cost a few cents. That’s cost-prohibitive for production.

Maybe agents aren’t ready for primetime?

Building with agents is slow and unreliable, and turns prototype development into a waiting game. They require a lot of complexity, can rack up costs quickly, and don’t produce consistent results. What if we didn’t use agents?

What if we could just send the entire spreadsheet as a prompt? With Gemini Pro and Claude Opus, we can fit relatively large spreadsheets, without running into the problem of reaching the maximum “context length.”

Unfortunately, just asking the language model in a single prompt to get the data rarely works, as it’s prone to hallucinations, mistakes. It’s unclear however if it’s the prompt length or table complexity that is “confusing” the language models, so we created two simple datasets on phage titration and phage growth curves to see if we could get the right answer.

Example spreadsheet on phage titration and growth curves. Inspecting it will reveal that it contains four datasets.

These are not as complex as the Spark data, but capture the essence of messy spreadsheets. In the first example, two datasets are placed side by side in a single sheet. In the second example, we have growth curves for two phages. An AI model should be able to recognize that both sheets have multiple datasets.

Given this new test case, we set out to programmatically ask GPT-4 and Claude Opus in a series of prompts to identify the boundaries. Instead of asking them to reformat the data (which could lead to hallucinations), we ask them to identify boundaries for datasets. For example, in the above dataset, A2:B11 and C2:D11 are the boundaries of two separate growth curve datasets.

Directly asking either GPT-4 or Claude to identify boundaries by giving them the CSV representation of data… did not work very well. They would draw boundaries arbitrarily. What did work better though is suggesting to the AI to describe the data, analyze the boundaries, then re-check its own answer until it’s satisfied.

Strategies for finding boundaries for small spreadsheets

Chaining a few prompts together worked surprisingly well! It was much faster than using LangChain or Assistants agents, and generally got better results. However, we still had a problem with slowness. Both GPT-4 and Claude Opus are very complex models, which means they’re very slow. Getting a response from a chain of prompts would still take a few minutes. How can we fix that?

Prototyping with speed

One day after OpenAI released its new Assistants API (April 17th), Meta released its new Llama 3 model. Many were claiming the model performed almost as well as Claude Opus. But even better — it could run on machines as small as Macbooks, and it was openly available — which meant providers like Groq.com would soon provide it. Groq is a compute provider which is heavily optimized for LLMs, allowing them to provide insanely fast inference speeds

Demo of using Groq and Llama 3 to extract data. It runs multiple prompts under the hood. Its speed is measured in seconds while Claude Opus and OpenAI’s GPT-4 is measured in minutes.

Swapping an ultrafast AI for the much slower Opus or GPT-4 unlocks new UX paradigms. Instead of forcing users to wait for an agent to do all of the work, we could instead add a data exploration copilot where you can select any data and it’ll explain what you’ve selected. This type of live exploration could be paired with an export functionality to help you “slice and dice” your data with ease.

Most AI models have focused on increasing the intelligence but not speed. However, really fast inference speeds actually make the product experience better. Previously, the AI models took too long to respond, so we would have very slow feedback loops. But now that the feedback loops are instantaneous we can have the model check and correct its own answers in the background, leading to fewer errors.

What other UX/UI paradigms can be explored, now that AI is instantaneous?

We’re at the beginning of the beginning

Many technologies we used in Slice & Dice — Streaming Assistants, and Llama 3 — have only existed for two weeks. Llama 3 + Groq has opened the door up for new categories of user interfaces and it’s only been around since mid-April.

Slice & Dice is also still at the beginning, and we’re excited to explore several other strategies to pull insights from spreadsheets:

  1. Use common patterns of dirty data to guide the language model. There are a couple of dozen ways that spreadsheet data can be commonly messy. We can use few-shot examples and chained prompts to guide language models to get better results.
  2. Use pattern-based strategies and chained agents — An extension of the above idea using agents. We could use an overseeing agent that directs other agents which specialize in different data patterns. Using a fleet of parallel agents could also help overcome the speed issues.
  3. Provide tools for the model to “navigate” the spreadsheet — We could alternately provide tools for a language model to “explore” the data programmatically. These navigation tools could look like “get the data from row 1” or “get the median from column B”. These tools can give models a “glimpse” of data — like a flashlight — as it searches for the answer. SWE-agent has demonstrated that building special tools and data formats can drastically increase LLM performance.
  4. Design more interfaces for co-creation — With the speed of Llama 3 and Groq, we should explore more tools that help users collaborate and co-create with the AI. Users could highlight the data, or describe the data, or guide the agent, in various ways. Maybe the AI can make suggestions, and the user works with those suggestions as they accomplish their goals.

Spreadsheets are the beginning

All labs work around spreadsheets, and while tools like Benchling have attempted to displace them entirely, no labs have managed to move entirely off spreadsheets.

If we could get any information from a spreadsheet in just a click — in any shape we need: as charts or CSVs or publication-ready tables — then we could spend more time on our hypotheses and less time in our sheets.

We’ve spent a lot of time thinking about getting data from spreadsheets.

If this sort of work excites you, play with our demo, inspect (and improve on — PRs welcome) our code. Say hello@sphinxbio.com or take a look at our jobs sphinxbio.com/careers


View our latest demo at sheets.sphinxbio.com

View our latest code on Github: https://github.com/sphinxbio/sliceanddice