Introducing platechain, a universal plate parser

October 31, 2023

Introduction

Do you have plate data and platemaps stored in Excel sheets or CSVs? Biological research takes many forms, but one tool remains a staple across various applications: the microplate. Whether you're quantifying biomarkers with an ELISA or running high-throughput reporter assays to accelerate drug discovery, plates are indispensable. Yet, despite their ubiquity, the handling of plate data often remains trapped in outdated, cumbersome systems like Excel sheets and CSV files. That’s why we’re excited to introduce platechain, an open-source Python package for universal plate parsing.

platechain can be used to parse both platemaps, which outline the intended design of an experiment, as well as experimental results recorded in a plate layout. This dual functionality makes platechain a very helpful tool in bringing together experimental metadata and results.

While platechain is solely focused on plate based data, we have also seen evidence that our approach can be extended to other applications in biology. This includes the parsing of semi-structured information from documents, such as Contract Research Organization (CRO) data. If you're grappling with similar data parsing issues, we'd love to hear from you.

The problem

Plates have several defining characteristics that lend them well to developing a universal parser. They come in a range of standard sizes: 24, 96, 384, or 1536 wells. They always have a rectangular layout. However, despite this uniformity, parsing plates can be an extremely painful process. Plate data can come from various machines like plate readers, liquid handlers, or even scientists writing directly into a spreadsheet. In addition, each machine type — and sometimes even different firmware versions of the same machine — outputs data in a unique format. This inconsistency means that scientists and informaticians have to write a unique parser for each machine. Labs usually operate multiple types of machines, meaning that even small labs might have to write and maintain half a dozen parsers.

An example of a typical excel sheet output by a plate reader

Adding to this complexity, many scientists create custom platemaps by hand, which follow no set pattern. These designs might differ between scientists or even between runs from the same scientist. Building a parser for these custom platemaps is quite difficult and often leads to scientists copying and pasting values by hand. This problem can grow so larger that some biotech companies will build a specialized web application for plate design — a serious amount of work — just to deal with this issue.

platechain aims to offer a simple one-size-fits-most parser that solves these challenges.

How does it work?

Traditional plate parsers are extremely specific to a given manufacturer and device information and take a lot of work to develop. This means the code is tightly tied to a specific output. platechain takes a different approach by leveraging recent advances in LLMs (e.g. ChatGPT) to extract plate information. Rather than extracting the raw values directly, however, it uses an LLM to produce a JSON object specifying the location of the corners of the plate. This JSON object can then be used by downstream code to reliably extract plate information from messy spreadsheets.

The JSON format the LLM returns (see the code for how to customize)

The method employed by platechain has roots in research on information extraction from semi-structured documents. In particular, Google just released a paper demonstrating how LLMs can be used to extract structured information from semi-structured formats such as receipts. platechain provides an initial implementation of these research ideas in the biological domain.

The final parsed plate

If you’re interested in more details about how this approach was developed, we provide more details about our iterative prompt engineering process in our recent LRIG Presentation.

Using platechain

Getting started is a simple as can be: pip install platechain.

platechain is released as a python package with an Apache-2.0 License, meaning that anyone can get started parsing plates with a simple pip install. We provide some example notebooks in our repository showing how to use the package.

Under the hood, we use LangChain to perform prompt engineering, help with LLM interactions, and structuring the output. This has the added benefit of being able to easily plug into their growing ecosystem. In particular, we were able to easily add a LangServe template to give users an easy way to deploy this as an API with a simple interactive user interface (see their launch post for more details).

LangServe Playground for platechain

Limitations

While platechain aims to provide a universal plate parser, but there are a number of limitations that might not make it suitable for all use cases yet.

  • Standard Plate Sizes Only: Currently, platechain is designed to work with standard plate sizes, such as 24, 96, 384, or 1536 wells. Custom sizes are not officially supported, though this should be easy to extend.
  • Rectangular Designs Only: platechain also only parses rectangular plate layouts. Unconventional designs or plate representations are not officially supported
  • Long Inputs/Outputs: If there are too many plates in a single file, platechain might fail to parse them correctly. LLMs have a limited context window, and platechain does not yet support chunking documents.
  • Off-by-One Errors: platechain can be susceptible to off-by-one errors, where it may include the plate header as part of the data. Production systems (such as what we’ve built at Sphinx) will need to handle this failure mode.
  • Not Always Correct: As with any LLM application, it is impossible to guarantee 100% accuracy. If you encounter any sort of problems with a specific file type or machine, please file an issue!

Despite these caveats, we have found that a large number of plate based assays are able to be easily parsed by platechain.

How you can get involved

Head over to https://github.com/sphinxbio/platechain to make feature requests or add your own expertise! We open sourced this because we think this would be broadly helpful

Feel free to reach out with any questions to: platechain@sphinxbio.com

If you get excited by open source and want to build better software for scientists, we’re hiring!

Interested in a hosted version of platechain or chatting more about how LLMs can accelerate biology? Let’s chat.

Excited by the idea of better software for scientists?
Let's talk
get started
October 31, 2023

Introducing platechain, a universal plate parser

Share This post -

Introduction

Do you have plate data and platemaps stored in Excel sheets or CSVs? Biological research takes many forms, but one tool remains a staple across various applications: the microplate. Whether you're quantifying biomarkers with an ELISA or running high-throughput reporter assays to accelerate drug discovery, plates are indispensable. Yet, despite their ubiquity, the handling of plate data often remains trapped in outdated, cumbersome systems like Excel sheets and CSV files. That’s why we’re excited to introduce platechain, an open-source Python package for universal plate parsing.

platechain can be used to parse both platemaps, which outline the intended design of an experiment, as well as experimental results recorded in a plate layout. This dual functionality makes platechain a very helpful tool in bringing together experimental metadata and results.

While platechain is solely focused on plate based data, we have also seen evidence that our approach can be extended to other applications in biology. This includes the parsing of semi-structured information from documents, such as Contract Research Organization (CRO) data. If you're grappling with similar data parsing issues, we'd love to hear from you.

The problem

Plates have several defining characteristics that lend them well to developing a universal parser. They come in a range of standard sizes: 24, 96, 384, or 1536 wells. They always have a rectangular layout. However, despite this uniformity, parsing plates can be an extremely painful process. Plate data can come from various machines like plate readers, liquid handlers, or even scientists writing directly into a spreadsheet. In addition, each machine type — and sometimes even different firmware versions of the same machine — outputs data in a unique format. This inconsistency means that scientists and informaticians have to write a unique parser for each machine. Labs usually operate multiple types of machines, meaning that even small labs might have to write and maintain half a dozen parsers.

An example of a typical excel sheet output by a plate reader

Adding to this complexity, many scientists create custom platemaps by hand, which follow no set pattern. These designs might differ between scientists or even between runs from the same scientist. Building a parser for these custom platemaps is quite difficult and often leads to scientists copying and pasting values by hand. This problem can grow so larger that some biotech companies will build a specialized web application for plate design — a serious amount of work — just to deal with this issue.

platechain aims to offer a simple one-size-fits-most parser that solves these challenges.

How does it work?

Traditional plate parsers are extremely specific to a given manufacturer and device information and take a lot of work to develop. This means the code is tightly tied to a specific output. platechain takes a different approach by leveraging recent advances in LLMs (e.g. ChatGPT) to extract plate information. Rather than extracting the raw values directly, however, it uses an LLM to produce a JSON object specifying the location of the corners of the plate. This JSON object can then be used by downstream code to reliably extract plate information from messy spreadsheets.

The JSON format the LLM returns (see the code for how to customize)

The method employed by platechain has roots in research on information extraction from semi-structured documents. In particular, Google just released a paper demonstrating how LLMs can be used to extract structured information from semi-structured formats such as receipts. platechain provides an initial implementation of these research ideas in the biological domain.

The final parsed plate

If you’re interested in more details about how this approach was developed, we provide more details about our iterative prompt engineering process in our recent LRIG Presentation.

Using platechain

Getting started is a simple as can be: pip install platechain.

platechain is released as a python package with an Apache-2.0 License, meaning that anyone can get started parsing plates with a simple pip install. We provide some example notebooks in our repository showing how to use the package.

Under the hood, we use LangChain to perform prompt engineering, help with LLM interactions, and structuring the output. This has the added benefit of being able to easily plug into their growing ecosystem. In particular, we were able to easily add a LangServe template to give users an easy way to deploy this as an API with a simple interactive user interface (see their launch post for more details).

LangServe Playground for platechain

Limitations

While platechain aims to provide a universal plate parser, but there are a number of limitations that might not make it suitable for all use cases yet.

  • Standard Plate Sizes Only: Currently, platechain is designed to work with standard plate sizes, such as 24, 96, 384, or 1536 wells. Custom sizes are not officially supported, though this should be easy to extend.
  • Rectangular Designs Only: platechain also only parses rectangular plate layouts. Unconventional designs or plate representations are not officially supported
  • Long Inputs/Outputs: If there are too many plates in a single file, platechain might fail to parse them correctly. LLMs have a limited context window, and platechain does not yet support chunking documents.
  • Off-by-One Errors: platechain can be susceptible to off-by-one errors, where it may include the plate header as part of the data. Production systems (such as what we’ve built at Sphinx) will need to handle this failure mode.
  • Not Always Correct: As with any LLM application, it is impossible to guarantee 100% accuracy. If you encounter any sort of problems with a specific file type or machine, please file an issue!

Despite these caveats, we have found that a large number of plate based assays are able to be easily parsed by platechain.

How you can get involved

Head over to https://github.com/sphinxbio/platechain to make feature requests or add your own expertise! We open sourced this because we think this would be broadly helpful

Feel free to reach out with any questions to: platechain@sphinxbio.com

If you get excited by open source and want to build better software for scientists, we’re hiring!

Interested in a hosted version of platechain or chatting more about how LLMs can accelerate biology? Let’s chat.