Dataman Screencast Video

Flow cytometers and optical blood analyzers produce large amounts of multivariate data. A blood analyzer typically measures four or five parameters for 10,000 cells of each sample. The data for any one cell has little meaning, while the average for all cells is completely useless. This is a classic big data problem. Reportable results are derived statistically, primarily by cluster analysis. A clinical instrument automatically performs this analysis based on known patterns, routinely reporting only a very condensed set of results. Although most clinical analyzers can show and record raw data, this is of little interest to the typical user. Analytical instruments measure many of the same parameters but are used by researchers, whose primary interest is the raw data. In both cases, the instrument itself cannot discover new patterns. That must be done by hematologists looking at raw data.

New useful patterns typically require at least one thousand samples to become evident. Even the most insightful hematologist is not going to see a subtle pattern in 50 million numbers. A few patterns, discovered largely by accident, would not be sufficient. Hematologists continually find new ways to prepare blood to help reveal cell types. The success of these largely depends on whether they produce a reliable statistical pattern. The massive data sets have to be presented in some way that reduces complexity without losing the detail that reveals new patterns.

Histograms of a single parameter and scatter plots of two are the most widely used forms for displaying flow cytometry data. Most clinical instruments can produce these in near real time, but these systems are intended for continuous high-speed sample processing and lack the flexibility for detailed multi-parameter analysis of large data sets. The Flow Cytometry Standard (FCS) defines a nearly universal file format for flow cytometers to export their data to other systems for offline analytical review.

A number of commercial and free programs exist for the specific purpose of reading and displaying FCS files. These tend to emphasize preparation of a few publication-quality images, for which the user can afford to spend considerable time and effort. They are not good for rapidly analyzing thousands of samples with many different parameter combinations. I wrote the Dataman program to give research hematologists the tool they need for this.

Dataman makes plot design very easy. It presents the parameters defined in an FCS file in a table for selecting any or all possible histograms and scatter plots. It treats all selected plots as thumbnails in a gallery with uniform sizing, yet allows any of them to be resized in place or pulled out of the gallery. It annotates plots according to a user-controlled formula that adapts to the size of plot. Selective zooming provides a detailed picture of areas of interest without having to increase a plot's window size.

It can be useful to determine where the cells that comprise a cluster in one plot appear in other plots. This is called gating. When the user assigns a gate type to a cluster in one plot, Dataman applies that color to those same cells in all other displayed plots. New gate types can be defined at any time and are immediately available for assignment to areas drawn by the user.

Its convenient plotting features make Dataman a much better tool for looking at flow cytometry data in new ways but it has one feature that goes beyond convenience. It can display the data in sequential FCS files, following either their order in a directory or list of files and/or directories. Hot keys Ctrl+N and Ctrl+P conveniently open the next or previous file in the list. This would be a significant convenience even if Dataman plotted as slowly as similar programs, most of which require several seconds or more to read an FCS file and produce plots from the data. Dataman can read an FCS file containing four parameters of ten thousand cells and produce all four possible histograms and six scatter plots in less than a quarter second. It can do this on an ordinary desktop computer without any special hardware to accelerate display or computation. Dataman's speed results from internal data structures and algorithms that I have optimized for this purpose.

Plotting at this speed enables Dataman to display sequences of 2-D pictures as a movie, essentially using time to display a third parameter without occluding details of the other two parameters.

DATAMAN