The identification of two biologically distinct subtypes – activated B cell (ABC) and germinal center (GC) – of diffuse large B-cell lymphoma (DLBL) by their gene expression profiles transformed our understanding of disease pathogenesis and led to high expectations for the development of novel targeted therapies, especially for the unfavorable ABC type. Progress in the field has been hampered, however, by inconsistencies inherent in using immunohistochemistry (IHC) staining to distinguish the subtypes. Although several IHC algorithms have been proposed to separate the good-prognosis GC type from the aggressive ABC type, inter-observer variability in interpretation of pathologic specimens has plagued this approach. For this reason, gene expression arrays remain the optimal approach to subtyping, and the technical complexities of these systems are steadily becoming more manageable. Methods to extract sufficient mRNA from formalin-fixed, paraffin-embedded (FFPE) tissue have been improved to the extent that routine diagnostic specimens can now be used without having to rely on fresh tissue. The other obstacle to clinical implementation of microarray analysis has been the lack of bioinformatic tools sufficiently robust to allow for real-time classification of individual cases as they occur in routine practice. Previous instruments depended on retrospective analysis of bulk data and were therefore useful for research studies but not for analysis of individual samples. The groups from the Haematological Malignancy Diagnostic Service at St. James’s University Hospital, Leeds, U.K., and the Bioinformatics Group at the University of Leeds appear to have now resolved this problem as well.

Starting from the cell-of-origin classifier developed at the National Cancer Institute,1  Care et al. evaluated more than 30 different machine learning tools used to determine the DLBL cell of origin that have been reported in the literature. Machine learning tools derive from artificial intelligence research and center on algorithms that learn to identify recurring patterns from available data and link them to specific decisions or assigned classes. Once trained, the algorithms can apply the learned rules to new data. The paper by Care and colleagues showed that a combination of four of these machine learning tools, voting in a balanced fashion, could effectively classify DLBL data sets from a range of different tissue sources and array platform types. Importantly, this combination of tools was better at separating the good (GC) and poor (ABC) risk groups in most data sets, including those derived from either FFPE or fresh material. They did this without assigning more cases to the “unclassified”/type III category, although a small molecular gray zone always persists. In validating this classifier, the authors performed an extensive meta-analysis of gene expression across 10 data sets (more than 2,000 samples) of DLBL. Interestingly, the optimized classifier requires only 20 of the genes that were included in the original NCI classifier, and the authors were also able to show that increasing the number of analyzed genes up to 180 did not improve the ability to classify subtypes with significant differences in survival.

The tool, described as the DLBL automatic classifier (DAC), is available as an open source, free-standing application. It can be found at www.bioinformatics.leeds.ac.uk/~bgy7mc/DAC/ and can be downloaded and used to classify both whole data sets and individual cases. For the latter, all that is required is a background data set of at least 30 cases for the array platform type being used before starting to type individual cases.

Taking molecular phenotyping from a research tool into clinical application has been slower than expected, mainly for technical reasons. DAC may represent an important step in bringing microarray analysis to the clinic by providing a widely applicable platform for allocating cases to ABC or GC subtypes, prospectively. Because the gene set involved is relatively small, it is also potentially applicable to gene expression assessments derived from RT-PCR or nanostring platforms. Although the DAC diverges in some details of the algorithm from the original cell-of-origin classifier, the extensive analyses reported in this paper indicate that the classification choices it makes are fully consistent with the cell-of-origin paradigm, and the differences in survival between the major classes are greater using this tool than any other. We now look forward to seeing the results of its use in prospective clinical trials.

1.
Wright G, Tan B, Rosenwald A, et al. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B-cell lymphoma. Proc Natl Acad Sci USA. 2003;100:9991-9996.

Competing Interests

Dr. Johnson indicated no relevant conflicts of interest.