.. _RepertoireSchema: Repertoire Schema ============================= A ``Repertoire`` is an abstract organizational unit of analysis that is defined by the researcher and consists of study metadata, subject metadata, sample metadata, cell processing metadata, nucleic acid processing metadata, sequencing run metadata, a set of raw sequence files, data processing metadata, and a set of ``Rearrangements``. A ``Repertoire`` gathers all of this information together into a composite object, which can be easily accessed by computer programs for data entry, analysis and visualization. A ``Repertoire`` is specific to a single subject otherwise it can consist of any number of samples (which can be processed in different ways), any number of raw sequence files, and any number of rearrangements. It can also consist of any number of data processing metadata objects that describe the processing of raw sequence files into ``Rearrangements``. Typically, a ``Repertoire`` corresponds to the biological concept of the immune repertoire for that single subject which the researcher experimentally measures and computationally analyzes. However, researchers can have different interpretations about what constitutes the biological immune repertoire; therefore, the ``Repertoire`` schema attempts to be flexible and broadly useful for all AIRR-seq studies. Another researcher can take the same raw sequencing data and associated metadata and create their own ``Repertoire`` that is different from the original researcher's. A common example is to define a repertoire that is a subset such as "productive rearrangements for IGHV4" whereas the original researcher defined a more generic "B cell repertoire". This new ``Repertoire`` would have much of the same metadata as the original ``Repertoire``, except associated with a different study, and with additional information in the data processing metadata that describes how the rearrangements were filtered down to just the "productive rearrangements for IGHV4". Likewise, another researcher may get access to the original biosample material and perform their own sample processing and sequencing, which also would be a new ``Repertoire``. That new ``Repertoire`` could combine samples from the original researcher's ``Repertoire`` with the new sample data as a large dataset for the subject. Multiple Data Processing on a Repertoire -------------------------------------------------------------------------------- Data processing can be a complicated multi-stage process. Documenting the process in a formal way is challenging because of the diversity of actions that may be performed. The MiAIRR standard requires documentation of the process but in an informal way with free text descriptions. A ``Repertoire`` might undergo multiple different data processing for any number of reasons, e.g. to compare the results from different toolchains, or to compare different settings for the same toolchain. It is expected that all of the ``Samples`` of a ``Repertoire`` will be processed together within a ``DataProcessing``. That is, a ``DataProcessing`` that only uses some but not all samples in a ``Repertoire`` could be confusing to users and appear as though data is missing. Likewise, processing some samples within a ``Repertoire`` with one ``DataProcessing`` and the remaining samples with a different ``DataProcessing`` could also confuse users. Because ``DataProcessing`` is unstructured information, it is not possible to validate that all ``Samples`` in a ``Repertoire`` are being processed together, so this expectation cannot be strictly enforced. Having multiple ``DataProcessing`` for a ``Repertoire`` will create multiple sets of ``Rearrangements`` that are distinct and separate from each other. Analysis tools need to be careful not to mix these sets of ``Rearrangements`` from different ``DataProcessing`` because it can generate incorrect results. The identifier ``data_processing_id`` was added so ``Rearrangements`` can identify their specific ``DataProcessing``. Linking Data -------------------------------------------------------------------------------- Each ``Repertoire`` has a unique ``repertoire_id`` identifier. This identifier should be globally unique so that repertoires from multiple studies can be combined together without conflict. The ``repertoire_id`` is used to link other AIRR data to a ``Repertoire``. Specifically, the :ref:`Rearrangements Schema ` includes ``repertoire_id`` for referencing the specific ``Repertoire`` for that ``Rearrangement``. If a ``Repertoire`` has multiple ``DataProcessing`` then ``data_processing_id`` should be used to distinguish the appropriate ``DataProcessing`` within the ``Repertoire``. The ``Rearrangements`` contains ``data_processing_id`` for this purpose. The ``data_processing_id`` is only unique within a ``Repertoire`` so ``repertoire_id`` should first be used to get the appropriate ``Repertoire`` object and then ``data_processing_id`` used to acquire the appropriate ``DataProcessing``. It is expected that typical ``Repertoires`` might only have a single ``DataProcessing``, in which case ``repertoire_id`` and ``data_processing_id`` will be semantically equivalent and only the former should be used. If a ``Repertoire`` has multiple sample processing objects in the sample array then ``sample_processing_id`` should be used to distinguish the the approrpiate sample processing object within the ``Repertoire``. The ``Rearrangement`` object can contain a ``sample_processing_id`` to uniquely identify a sample processing object within a ``Repertoire``. Like ``data_processing_id``, the ``sample_processing_id`` is only unique within the ``Repertoire`` so ``repertoire_id`` should first be used to get the appropiate ``Repertoire`` object and then ``sample_processing_id`` should be used to determine the appropiate sample processing object that is associated with the ``Rearrangement``. If the ``Rearrangement`` object does not have a ``sample_processing_id`` then it can be assumed that the rearrangement is associated with all of the samples in the ``Repertoire`` (e.g. the rearrangement is a collapsed rearrangement across multiple samples). It is expected that ``Repertoires`` might often have a single sample processing object, in which case ``repertoire_id`` and ``sample_processing_id`` will be semantically equivalent and only the former should be used. Finally, if it is necessary to link a ``Rearrangement`` object with a unique pairing of sample processing and ``DataProcessing``, the ``repertoire_id`` of the ``Rearrangement`` object should be used to identify the correct ``Repertoire`` object and then the ``data_processing_id`` should be used to identify the correct ``DataProcessing`` metadata and the ``sample_processing_id`` should be used to identify the correct sample processing metadata within that ``Repertoire``. Duality between Repertoires and Rearrangements -------------------------------------------------------------------------------- There is an important duality relationship between ``Repertoires`` and ``Rearrangements``, specifically with the experimental protocols described in the ``Repertoire`` versus the annotations on ``Rearrangements``. A ``Repertoire`` defines an experimental design for what a researcher intends to measure or observe, while the ``Rearrangements`` are what was actually measured and observed. Technically, the border between the two occurs at sequencing, that is when the biological physical entity (prepared DNA) is measured and recorded as information (nucleotide sequence). This duality is important when considering how to answer certain questions. For example, ``locus`` for ``Rearrangements`` may have the value "IGH" which indicates that B cell heavy chain receptors were measured, yet the ``Repertoire`` might have "T cell" in ``cell_subset`` which indicates the researcher intended to measure T cells. This conflict between the two indicates something is wrong. Differences can occur in many ways, as with errors in the experimental protocol, or data processing might have incorrectly processed the raw sequencing data leading to invalid annotations. File Format Specification ----------------------------- Files are YAML/JSON with a structure defined below. Files should be encoded as UTF-8. Identifiers are case-sensitive. Files should have the extension ``.yaml``, ``.yml``, or ``.json``. File Structure ~~~~~~~~~~~~~~ + The file as a whole is considered a dictionary (key/value pair) structure with the keys ``Info`` and ``Repertoire``. + The file can (optionally) contain an ``Info`` object, at the beginning of the file, based upon the ``Info`` schema in the OpenAPI V2 specification. If provided, ``version`` in ``Info`` should reference the version of the AIRR schema for the file. + The file should correspond to a list of ``Repertoire`` objects, using ``Repertoire`` as the key to the list. + Each ``Repertoire`` object should contain a top-level key/value pair for ``repertoire_id`` that uniquely identifies the repertoire. + Some fields require the use of a particular ontology or controlled vocabulary. + The structure is the same regardless of whether the data is stored in a file or a data repository. For example, The :ref:`ADC API ` will return a properly structured JSON object that can be saved to a file and used directly without modification. Repertoire Fields ------------------------------ :download:`Download as TSV <../_downloads/Repertoire.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in Repertoire_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _StudyFields: Study Fields ------------------------------ :download:`Download as TSV <../_downloads/Study.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in Study_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _SubjectFields: Subject Fields ------------------------------ :download:`Download as TSV <../_downloads/Subject.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in Subject_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _DiagnosisFields: Diagnosis Fields ------------------------------ :download:`Download as TSV <../_downloads/Diagnosis.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in Diagnosis_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _SampleFields: Sample Fields ------------------------------ :download:`Download as TSV <../_downloads/Sample.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in Sample_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _CellProcessingFields: Tissue and Cell Processing Fields --------------------------------- :download:`Download as TSV <../_downloads/CellProcessing.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in CellProcessing_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _NucleicAcidProcessingFields: Nucleic Acid Processing Fields --------------------------------- :download:`Download as TSV <../_downloads/NucleicAcidProcessing.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in NucleicAcidProcessing_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _PCRTargetFields: PCR Target Locus Fields --------------------------------- :download:`Download as TSV <../_downloads/PCRTarget.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in PCRTarget_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _RawSequenceDataFields: Raw Sequence Data Fields --------------------------------- :download:`Download as TSV <../_downloads/RawSequenceData.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in RawSequenceData_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _SequencingRunFields: Sequencing Run Fields --------------------------------- :download:`Download as TSV <../_downloads/SequencingRun.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in SequencingRun_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %} .. _DataProcessingFields: Data Processing Fields --------------------------------- :download:`Download as TSV <../_downloads/DataProcessing.tsv>` .. list-table:: :widths: 20, 15, 15, 50 :header-rows: 1 * - Name - Type - Attributes - Definition {%- for field in DataProcessing_schema %} * - ``{{ field.Name }}`` - {{ field.Type }} - {{ field.Attributes }} - {{ field.Definition | trim }} {%- endfor %}