Metacell - Single-cell RNA Sequencing Analysis

Build Status Documentation Status

The metacell package implements the metacell algorithm for single-cell RNA sequencing (scRNA-seq) data analysis.

The original metacell algorithm was implemented in R. In contrast, the Python package described here implements an improved <https://todo/> version, which uses improved internal algorithms and supports much larger data sets (millions of cells).

Metacell Analysis

Naively, scRNA_seq data is a set of cell profiles, where for each one, for each gene, we get a count of the mRNA molecules that existed in the cell for that gene. This serves as an indicator of how “expressed” or “active” the gene is.

As in any real world technology, the raw data may suffer from technical artifacts (counting the molecules of two cells in one profile, counting the molecules from a ruptured cells, counting only the molecules from the cell nucleus, etc.). This requires pruning the raw data to exclude such artifacts. TODO: Create view should consider these issues.

The current technology scRNA-seq data is also very sparse (typically <10%, sometimes even <5% of the RNA molecules are counted). This introduces large sampling variance on top of the original signal, which itself contains significant inherent biological noise.

Analyzing scRNA-seq data therefore requires processing the profiles in bulk. Classically, this has been done by directly clustering the cells using various methods.

In contrast, the metacell approach groups together profiles of the “same” biological state into groups with the minimal number of profiles (~100) needed for computing robust statistics (in particular, mean gene expression). Each such group is a single “metacell”.

By summing profiles together, each metacell greatly reduces the sampling variance, and provides a more robust estimation of some transcription state. In particular, a metacell is not a cell type (multiple metacells may belong to the same type), and is not a parametric model of the cell state.

The metacells should therefore be further analyzed using additional methods to classify cell types, detect cell trajectories and/or lineage, build parametric models for cell behavior, etc. Using metacells as input for such analysis techniques should benefit both from the more robust, less noisy input; and from the (~100-fold) reduction in the number of profiles to analyze.

An obvious technique is to recursively group the metacells into larger groups, and investigate the resulting clustering hierarchy. Methods for further analysis of the data may choose to build upon this readily available hierarchy (which is also provided by this package).

Usage

Installation

Given Python version 3.6 and above, run pip install --user metacell (or, sudo pip install metacell if you have sudo privileges). This will install this package and two scripts: metacell_flow and metacell_call.

The metacell package is built using the DynaMake package. The metacell_flow script is simply the dynamake script, pre-loaded with the metacell build steps, and metacell_call is simply the dynacall script, pre-loaded with the metacell computation functions.

Importing

To use the package, create a directory somewhere with a large amount of available space (for ~2M cells, metacell required up to 200GB of disk space). If using a servers cluster, on some shared file system, and provide a DynaMake configuration file with an appropriate run_prefix for distributing jobs across the cluster; see the DynaMake documentation for details.

Create a raw sub-directory and download the raw_data into it. Then, run metacell_import to import the gene and cell data into the fast-access format used by the metacell package. This is unfortunately a slow process (in particular, the delaning step). Luckily, you only need to do this once.

This will create the following directories:

genes
Describes the genes used in the data.
cells
A COLLECTION profiles directory containing all the imported cells.

Grouping

TODO: Running metacell_group.

TODO: excluded_genes_names.txt, forbidden_genes_names.txt.

Zoning

TODO: Running metacell_zone.

Analysis

TODO: Running metacell_dash.

TODO: colors directory.

Directory Structure

The provided automated workflow uses the following directory structure. The specifics of each directory type are provided further below.

ROOT

A directory which is typically named after the data source (PBMC, MOCA, HCA, etc.). Will contain:

raw

A directory containing the raw data from the data source. Should contain one or more:

name
A directory containing 10X or MOCA data.
name.h5
An HDF file containing HCA data.
std

The first processing step will convert the raw data to the standard input format. The result will contain:

name
A directory in the standard (STD) format ready to be delaned.

Creating the standard format directory is pretty fast. It typically requires either a simple copy or applying something like sed to fix minor formatting issues. For HCA files, it is just a matter of dumping the data into a textual format.

delaned

The next processing step will split the input into distinct lanes. The result will contain:

name

A directory for each of the standard format directories. This would be either a standard format directory, or will contain:

lane
A directory in the standard format containing just the data for a single lane.

When data contains multiple lanes, delaning is a painfully slow process as it needs to split the matrix market file, processing one line at a time (without any parallelization). It would have been much easier if the raw data was already split into lanes.

genes

A directory containing fast access per-gene data (GENES) extracted from the delaned data.

The extraction process first verifies this data is identical for all the delaned directories. Then one of these directories is used to actually extract the data. This is very fast, as there are only ~30,000 genes to deal with.

cells

A directory containing fast access per-gene-per-profile (COLLECTION) data. Will contain:

name

A directory for each of the delaned directories. This would either be a simple profiles batch (BATCH), or a collection combining:

lane
A profiles batch directory containing the data for a single lane.

Conversion of the (delaned) textual data to the fast-access format isn’t very fast, but this is offset by being able to run this conversion in parallel for each lane. It is therefore much less painful than delaning. Once conversion is done, the import process is complete and does not need to be repeated.

views

This is where the metacells package performs its analysis, recursively grouping the cells into metacells. Should contain:

name

Some view profiles container (VIEW) based on the full cells profiles collection, to group. Typically such a view will exclude profiles with too-low total UMIs count, or that suffer from other technical issues (e.g. doublets). Genes may also be excluded.

The main function of the metacell package is to recursively compute the groups for such a view (that is, extend it to become a GROUPED directory).

Raw Data

The formats which can be used as the raw input for metacell processing:

10X

A directory containing 10X Genomics data. Should contain the files:

matrix.mtx
Matrix market UMIs data.
barcodes.tsv
The barcode of each profile (possibly with a -lane suffix).
genes.tsv
The ensembl ID and name of each gene.
MOCA

A directory containing Mouse Organogenesis Cell Atlas data. Should contain the files:

gene_count.txt
Matrix market UMIs data.
cell_annotate.csv
Per-cell data (see list of expected fields in metacell/storage/moca_format.yaml).
gene_annotate.csv
Per-gene data (see list of expected fields in metacell/storage/moca_format.yaml).
HCA
A file containing Human Cell Atlas in HDF format with a .h5` suffix. Must contain a single top-level key with the sub-keys barcodes, indptr, genes, indices and data.

These formats are converted to a standard input format:

STD

A directory containing data in a standard metacell input format. Should contain the files:

umis.mtx
Matrix market UMIs data.
profiles.csv
Per-profile data.
genes.csv
Per-gene data.
format.yaml
A description of the fields of the CSV files.

Fast Access Data

Formats of data allowing fast access during the metacell processing.

GENES

A directory containing per-gene data, accessed using the metacell.storage.genes.Genes class. Should contain the files:

genes.yaml
General meta-data describing this set of genes. Should contain the keys organism, genes_count and uuid, which must be equal to the md5sum of the genes.name.txt file. This UUID uniquely identifies this set of genes.
genes.*.txt
Per-gene string data. These should include genes.name.txt containing the (sorted, all-upper-case) gene names, and genes.ensembl.txt containing the ensembl identifier of each gene.
genes.*.npy
Per-gene numeric data, if any.
CONTAINER

Either a BATCH, COLLECTION, or VIEW directory, providing access to per-gene-per-profile data for some genes and profiles using the metacell.storage.profiles.ProfilesContainer class.

Each container directory contains at least the file:

profiles.yaml

General meta-data describing this container of profiles. Should contain at least the keys:

organism
The organism whose cells were sequenced (e.g. human or mouse).
genes_uuid
The UUID of the genes per profile.
profiles_count, genes_count
The number of profiles, and number of genes per profiles.
profiles_kind
The kind of profiles (cells, metacells, clans, tribes, nations, hordes).
uuid
Should be equal to the md5sum of the matrix.umis.mmm or matrix.umis.npy file. This UUID uniquely identifies this profiles batch.
metadata.yaml
Additional arbitrary meta-data collected for this container profile over time.

Additional keys may be specified depending on the container type.

BATCH

A directory containing per-gene-per-profile data for some closely related profiles (a single batch), accessed using the metacell.storage.profiles.ProfilesBatch class. Barcodes can be safely assumed to be unique within each batch, but not between multiple batches. When investigating batch effects, this is the smallest group of profiles for which these effects can be assumed to be the same. Should contain the additional files:

gp.*.mmm, gp.*.npy

Per-gene-per-profile data. The mmm format is a metacell-specific compressed sparse data format accessed using the metacell.matrices.MemoryMappedMatrix class. Each batch should contain gp.umis.mmm or gp.umis.npy.

Additional per-gene-per-profile data may exist, for example:

gp.log_fold.npy
The log-fold-factor of the gene’s UMIs fraction (out of the profile’s total), compared to the gene’s median UMIs fraction (in all the profiles). This serves as an estimate of how strong a marker the gene is for this specific profile. This should not be used for single cells.
p.*.txt

Per-profile string data. Common data includes:

p.barcode.txt
The unique (in this batch) per-profile barcode (without any lane suffix).
p.name.txt
The unique (in this batch) per-profile string name, if any.
p.batch.txt, p.lane.txt
The name of the batch and the name of the lane (if any) per profile. These are the same for all profiles in the batch, but having these files is useful when collecting multiple batches to a single data set (see below).
p.*.npy

Per-profile numeric data. Common data includes:

p.total_umis.npy
The sum of the genes UMIs for each profile.
p.max_log_fold.npy
The maximal fold factor of a gene in this profile.
g.*.txt, g.*.npy

Per-gene string and numeric data, specific for this profiles batch, if any. Common data includes:

g.median_fraction.npy
The median fraction of the gene’s UMIs out of the profile’s UMIs. This is used as an estimate of the global gene expression. This should not be used for single cells.
g.min_rank_log_fold.npy
The minimal rank of the gene’s fold factor in some profile. That is, the lower the number, the stronger the marker the gene is for distinguishing between profiles. This is used to filter genes in heatmap visualizations. This should not be used for single cells.
VIEW

A directory containing a possibly restricted view of another profiles container accessed using the metacell.storage.profiles.ProfilesView class. The profiles.yaml file should contain the additional keys:

base_uuid
The UUID of the base profiles container this is a view of.
base_path
The path of the base profiles container this is a view of. If this is a relative path, it is relative to the view directory.

The directory may contain the additional file(s):

p.base_index.npy
If exist, contain the sorted indices of the base container profiles which are included in this view. Otherwise, the view contains all the base container profiles.
g.base_index.npy
If exist, contain the sorted indices of the base container genes which are included in this view. Otherwise, the view contains all the base container genes.

When accessing data of a view, if there exists a local file containing the data, it is used. Otherwise, if possible, the data of the base container is fetched and sliced to include only the relevant subset of profiles and/or genes. By default, data is assumed to be slicable. Non-slicable data is specified in metacell.storage.profiles.ProfilesView.GENES_DEPENDENT_DATA and metacell.storage.profiles.ProfilesView.PROFILES_DEPENDENT_DATA.

COLLECTION

A directory combining a collection of other profiles container directories into a single unified profiles container, accessed using the metacell.storage.profiles.ProfilesCollection class. The profiles.yaml file should contain the additional key:

combined

A list of mapping, one per combined profile container directories. Should contain the keys:

name
The name of a sub-directory of the collection directory, containing the profiles container which is combined into the overall collection.
uuid
The UUID identifying the combined profiles container.
profiles_count
The number of profiles in the combined profiles container.

For each listed name, a sub-directory with that name must exist, and contain a profiles container. When accessing data of a collection, if there exists a local file containing the data, it is used. Otherwise (except for per-gene data), the data is combined from the data of all the collected sub-containers.

Grouped Data

The primary function of the metacell package is to group profiles together, in particular, to group cell profiles into metacells, metacells into clans, etc.

When groups are computed, the following files are created:

p.group_index.npy
The group index for each of the container’s profiles. A negative value indicates the profile belongs to no group (is an “outlier”).
grouped
A profiles VIEW containing just the profiles with a non-negative group index, that is, only the non-outlier profiles.
outliers
A profiles VIEW containing just the outlier profiles (which are not included in any of the groups). Typically this would only hold p.base_index.npy in addition to the profiles.yaml.
groups
A GROUPS profiles container for the computed groups.
zones
A hierarchy of ZONE (view) containers based on the computed groups.
tmp

Temporary files used to compute the groupings, but need not be used once it has been computed. The contents of this directory depends on whether we are grouping a few profiles or many profiles.

If grouping a small number of profiles:

selected
A view of the profiles container, including only the selected (“feature”) genes.

If grouping a large number of profiles, we can’t directly compute the groups, since the basic algorithm has a complexity of O(N^2). Computation therefore must works through a divide-and-conquer algorithm. This requires different temporary files:

phase-1

A directory containing the second phase’s files.

pile-index
A profiles view of a random subset of a restricted number of to-be-grouped profiles. To help in sorting file names, the indices are zero-padded so all have the same width (e.g. 00, .., 10, 11, ...). Each such pile is independently (directly) grouped.
outliers-pile

This directory is a profiles view containing all profiles that were marked as an outlier in some indexed pile. It is also independently grouped. Unlike the indexed piles, this view may contain a large number of profiles so grouping it may require an intermediate step of its own.

Rare “types” do not have enough profiles in each of the piles to form groups, so will be marked as outliers. In this outliers-only view, there might be enough of these rare profiles to form a proper group.

When grouping the profiles in the final first phase outliers pile (small enough to be directly grouped), we disable outliers detection. That is, we insist that each of the profiles be assigned to some group by the end of the first phase, as these groups are used to compute the piles for the second phase and we’d like to give each profile a chance to be grouped in this second phase.

grouped
A view of the grouped profiles container, with p.group_index.npy containing the combined group indices from all the phase-1 piles.
groups

The intermediate grouping combines the groups from all the phase-1 piles, as well as the groups from the phase-1 outliers. This is the less-accurate grouping we will improve upon. To allow doing so, we compute the groups-of-groups (that is, we compute p.group_index.npy here), and use these to construct the second phase piles.

Similarly to grouping the outliers pile, we disable outlier detection when grouping the final pile, so that each group will be assigned to one of the second phase piles.

phase-2

A directory containing the second phase’s files. This is similar to the first phase’s structure, with the following differences:

  • The profiles contained in each pile are not chosen randomly. Instead, they are the profiles contained in one of the computed intermediate groups-of-groups.
  • The outliers pile contains the phase-1 outliers in addition to the outliers of the phase-2 indexed piles. This allows very rare profile “types” a second chance at being grouped.
  • There are no grouped or groups directories; the results are written directly to the original (parent) grouped profiles container. These final groups are recursively grouped as long as there is a large number of profiles, as described above.
GROUPS

A profiles container describing groups of profiles. The gp.umis.npy per-profile-per-gene UMIs count in this container will contain the sum of the per-profile-per-gene of each of the grouped profiles contained in each (group) profile.

A groups profiles container has identical format to a normal profiles batch directory, except that its profiles.yaml file contains a grouped_profiles key with the (relative) path to the grouped profiles container (typically ..), and a grouped_uuid key with the UUID of the grouped profiles container. For convenience, mean_grouped_count and mean_cells_count provide an estimate of the number of cells and profiles in each group.

The profiles_kind of the groups container should be based on the grouped container profiles kind. In particular, groups of cells are called metacells. Such groups are expected to contain cells of the “same” state.

When grouping a large number of profiles, the groups directory might itself be grouped. This recursion continues as long as the total number of profiles is “large”. Groups of metacells are called clans, groups of clans are called tribes, groups of tribes are called nations, and groups of nations are called hordes. All higher-level groups contain profiles with a “similar”, but not necessarily the “same” state.

Finally, a groups container will hold the files p.grouped_count.npy with the number of profiles summed into each group, and p.cells_count.npy with the number of cells summed into each group. These two numbers are the same for metacells, but differ for higher level groups.

ZONE

A tree directory is used to provide convenient views for exploring the hierarchy created by the computed groups. In each level of the tree, in addition to the groups and grouped containers, there is also a zones directory that contains views that group the overall profiles into groups-of-groups.

The directory structure for a small amount of cells would contain:

zones
    groups  # All the clans (groups of metacells)
        ...
    grouped  # All the metacells (grouped into clans).
        ...
    clan-0  # The 1st clan
        groups  # All the metacells in the 1st clan
            ...
        grouped  # All the cells in the 1st clan
            ...
    clan-1  # The 2nd clan
        groups  # All the metacells in the 2nd clan
            ...
        grouped  # All the cells in the 2nd clan
            ...
        ...
    ...

And a directory structure for a larger number (several hundred of thousands) of cells would contain:

zones
    groups  # All the tribess (groups of clans)
        ...
    grouped  # All the clans (grouped into tribes).
        ...
    tribe-0  # The 1st tribe
        groups  # All the clans in the 1st tribe
            ...
        grouped  # All the metacells in the 1st tribe
            ...
        clan-000  # The 1st clan in the 1st tribe
            groups  # All the metacells in the 1st clan
                ...
            grouped  # All the cells in the 1st clan
                ...
        clan-001  # The 2nd clan in the 1st tribe
            ...
        ...
    tribe-1  # The 2nd tribe
        groups  # All the clans in the 2nd tribe
            ...
        grouped  # All the metacells in the 2nd tribe
            ...
        clan  # The 1st clan in the 2nd tribe
            groups  # All the metacells in the 1st clan
                ...
            grouped  # All the cells in the 1st clan
                ...
        clan-091  # The 2nd clan in the 2nd tribe
            ...
        ...
    ...

A much larger number (tens of millions) of cells would include an additional hierarchy level for each nation. Even larges number of cells would (up to a few billion) would require an additional level for each horde. Currently the largest data sets are at most of a few million so this scheme should be sufficient for the foreseeable future.

In all cases, the grouped directory contains a p.group_index.npy which specifies the index of the group in the adjacent groups directory.