(Semi) Structured Scientific Data
How many ways can data inside a CSV or spreadsheet be structured? Every scientist can commiserate with the challenges of working with a messy file from an instrument, vendor, or coworker. Each file is different and machines aren’t good at finding the data inside the file and using it to build a complete Dataset – you need a human to do the work. In our past blog posts we talked about how we use AI to extract plate data from spreadsheets and built on that to extract many kinds of data using AI, with hope we can make the work done by humans easier.
We are excited to share more developments in this area featuring our new data import tool that lets scientists better extract complete Datasets from their files. This is available to all Sphinx Organization users today (and if you aren’t a Sphinx user you can sign up here or reach out to learn more.
From File to Dataset
Our data import tool lets you upload any number of files at once – useful if you have one file with your results or many files that are mix of results, metadata, and experimental plans. We built this import tool in response to this diversity of files we see from customers, CROs, and instrument vendors.
When you upload your files, we let you add them into a workspace where you can define where the data are located and what their layout are. We know that seeing the end result is import to ensure you are taking the right steps, so we also give you a realtime preview of what your data will look like in Sphinx.
When you have all the data regions identified, you can then merge them together into one final Dataset. We call this ‘merge’ because we know that not every data region in every data file is treated the same. Sometimes you want to use the data as new rows (such as parsing in multiple plates) and sometimes you want to connect the data (such as joining two Datasets together by the sample name). This is particularly useful when your Dataset is from many experiments, assays, or collaborators and you don’t want to make a separate Dataset for each of these sources.
No matter how you need to connect your data, our merge tool lets you define the requisite pairwise connections between extracted data regions. Each data region you selected is added into the final result, so you can see what the final Dataset preview will be once you use it in an Analysis.
Templates Lead to Reproducibility
When you finalize and import your first Dataset, all future Datasets can be created using the same data import. The data imports you create become Templates that reproducibly create Datasets using the same rules each time.
That means no more custom code, cut/paste, or getting lost in connecting data over multiple files. Our data import handles that all for you – requiring just minutes to initially build and seconds to successively apply as a template for future files. This is faster than the time it takes to create a Jupyter notebook or get a response from a coworker on how to connect the data files in your shared folder.
Ready to get started? If you’re already an Organization user, head over to your Sphinx account and click the “Advanced Dataset Import” and upload your data! If you need help you can read our docs or email us at support@sphinxbio.com. If you’re not yet an Organization user, reach out to us at hello@sphinxbio.com.
Want to know how it works? Read on!
How Does it Work?
At Sphinx, we conceptualize data as coming from structured units in a file, which we refer to as data regions. These data regions include both the values measured (value region) and their descriptive data (labels region). Label regions surround the values region to provide context to the values. These label regions can appear at the top, bottom, left, or right of the value region (though most often are only at the top or left).
Many scientific assays are carried out in plates, which can be described as a value region with both left and top label regions (and are generally considered a “matrix”, you can read more about other types here). The most common layout when analyzing data is a “table”, where a single top label region identifies each column. Some table and matrices may have multiple value regions in the same location, making them multi-labeled (equivalent to a multi-indexed item when working in a tool like pandas).
Tables that meet the following criteria are called “tidy tables” and are the layout our data import tool helps you create. (If you want to read more about tidy tables, and all the benefits they confer check out Wickham, Hadley. “Tidy data.” Journal of statistical software 59 (2014): 1 23. **https://doi.org/10.18637/jss.v059.i10)**
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.
When you upload a files to Sphinx, we parse out the files and any relevant tabs (if an Excel file was uploaded). Together these comprise the inputs that we allow you to select data regions from. When you define a data region we capture where those data are located and what type of layout the data are in (matrix, table). This lets us pivot out a subset of data to create a preview whilst still linking back to the source data region. We only load a preview in cases where you have a large Dataset, as transforming the entire Dataset each time you change a settings would be very slow.
Once all the data regions are identified we create an interim tidy table for each data region. We do this by using the label regions as variables and the value region as observations. Each variable creates a new column, and for each column we populate it with the value:label pairs from the data region. Depending on the number of label regions this is as simple as ingesting the table, or as complex as pivoting many overlapping label regions into one column. This lets us then join or concatenate the data, since we know the data are in a tidy format.
During the merging step we let you merge or concatenate the tidy tables one-by-one, starting with the first table you added. These are effectively pairwise, in-place joins or concatenations until only one table is left — the final table that becomes the Dataset.
This Dataset is assigned a schema, a unique identifier, and associated with all the files and transformations that led to its creation. After that it is available to be used in an Analysis!
Limitations
We know that selecting each data region can be tedious — especially when you have many in a single file. For now, our data import tool requires you to define where the data are located for each data region. This ensures high accuracy and quality for downstream data.
Plates are a special case of a matrix — where only one top and left label region are present. We started with this since we know it is the most common, and we realize that other kinds of matrix-like layouts exist. We are working to enhance how we parse matrices and how we pivot out data so that any kind of data region (such as one with top, bottom, left, and right label regions) can be parsed.
Future Direction
We are working to incorporate our other work (Extraction of data from spreadsheets and Extract data using AI) so that upload of data becomes as simple as providing the file. If you are excited by working on those kinds of problems and want to build better software for scientists, we’re hiring! You can see our latest code on Github at https://github.com/sphinxbio.
Reach out with any questions to hello@sphinxbio.com and thanks for reading.