publications
2023
- Semant. WebTerminology and ontology development for semantic annotation: A use case on sepsis and adverse eventsMelissa Y Yan, Lise Tuset Gustad, Lise Husby Høvik, and Øystein NytrøSemantic Web, May 2023
Annotations enrich text corpora and provide necessary labels for natural language processing studies. To reason and infer underlying implicit knowledge captured by labels, an ontology is needed to provide a semantically annotated corpus with structured domain knowledge. Utilizing a corpus of adverse event documents annotated for sepsis-related signs and symptoms as a use case, this paper details how a terminology and corresponding ontology were developed. The Annotated Adverse Event NOte TErminology (AAENOTE) represents annotated documents and assists annotators in annotating text. In contrast, the complementary Catheter Infection Indications Ontology (CIIO) is intended for clinician use and captures domain knowledge needed to reason and infer implicit information from data. The approach taken makes ontology development understandable and accessible to domain experts without formal ontology training.
- ClinicalNLP@ACLMethod for designing semantic annotation of sepsis signs in clinical textMelissa Y Yan, Lise Tuset Gustad, Lise Husby Høvik, and Øystein NytrøIn Proceedings of the 5th Clinical Natural Language Processing Workshop, May 2023
Annotated clinical text corpora are essential for machine learning studies that model and predict care processes and disease progression. However, few studies describe the necessary experimental design of the annotation guideline and annotation phases. This makes replication, reuse, and adoption challenging. Using clinical questions about sepsis, we designed a semantic annotation guideline to capture sepsis signs from clinical text. The clinical questions aid guideline design, application, and evaluation. Our method incrementally evaluates each change in the guideline by testing the resulting annotated corpus using clinical questions. Additionally, our method uses inter-annotator agreement to judge the annotator compliance and quality of the guideline. We show that the method, combined with controlled design increments, is simple and allows the development and measurable improvement of a purpose-built semantic annotation guideline. We believe that our approach is useful for incremental design of semantic annotation guidelines in general.
2022
- DevelopmentMolecular contribution to embryonic aneuploidy and karyotypic complexity in initial cleavage divisions of mammalian developmentKelsey E Brooks, Brittany L Daughtry, Brett Davis, Melissa Y Yan, Suzanne S Fei, Selma Shepherd, Lucia Carbone, and Shawn L ChavezDevelopment, Apr 2022
Embryonic aneuploidy is highly complex, often leading to developmental arrest, implantation failure or spontaneous miscarriage in both natural and assisted reproduction. Despite our knowledge of mitotic mis-segregation in somatic cells, the molecular pathways regulating chromosome fidelity during the error-prone cleavage-stage of mammalian embryogenesis remain largely undefined. Using bovine embryos and live-cell fluorescent imaging, we observed frequent micro-/multi-nucleation of mis-segregated chromosomes in initial mitotic divisions that underwent unilateral inheritance, re-fused with the primary nucleus or formed a chromatin bridge with neighboring cells. A correlation between a lack of syngamy, multipolar divisions and asymmetric genome partitioning was also revealed, and single-cell DNA-seq showed propagation of primarily non-reciprocal mitotic errors. Depletion of the mitotic checkpoint protein BUB1B (also known as BUBR1) resulted in similarly abnormal nuclear structures and cell divisions, as well as chaotic aneuploidy and dysregulation of the kinase-substrate network that mediates mitotic progression, all before zygotic genome activation. This demonstrates that embryonic micronuclei sustain multiple fates, provides an explanation for blastomeres with uniparental origins, and substantiates defective checkpoints and likely other maternally derived factors as major contributors to the karyotypic complexity afflicting mammalian preimplantation development.
- JAMIASepsis prediction, early detection, and identification using clinical text for machine learning: a systematic reviewMelissa Y Yan, Lise Tuset Gustad, and Øystein NytrøJournal of the American Medical Informatics Association, Jan 2022
OBJECTIVE: To determine the effects of using unstructured clinical text in machine learning (ML) for prediction, early detection, and identification of sepsis. MATERIALS AND METHODS: PubMed, Scopus, ACM DL, dblp, and IEEE Xplore databases were searched. Articles utilizing clinical text for ML or natural language processing (NLP) to detect, identify, recognize, diagnose, or predict the onset, development, progress, or prognosis of systemic inflammatory response syndrome, sepsis, severe sepsis, or septic shock were included. Sepsis definition, dataset, types of data, ML models, NLP techniques, and evaluation metrics were extracted. RESULTS: The clinical text used in models include narrative notes written by nurses, physicians, and specialists in varying situations. This is often combined with common structured data such as demographics, vital signs, laboratory data, and medications. Area under the receiver operating characteristic curve (AUC) comparison of ML methods showed that utilizing both text and structured data predicts sepsis earlier and more accurately than structured data alone. No meta-analysis was performed because of incomparable measurements among the 9 included studies. DISCUSSION: Studies focused on sepsis identification or early detection before onset; no studies used patient histories beyond the current episode of care to predict sepsis. Sepsis definition affects reporting methods, outcomes, and results. Many methods rely on continuous vital sign measurements in intensive care, making them not easily transferable to general ward units. CONCLUSIONS: Approaches were heterogeneous, but studies showed that utilizing both unstructured text and structured data in ML can improve identification and early detection of sepsis.
2021
- IEEE BIBMUnderstanding and Reasoning About Early Signs of Sepsis: From Annotation Guideline to OntologyMelissa Y Yan, Lise Husby Høvik, Lise Tuset Gustad, and Øystein NytrøIn IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021, Houston, TX, USA, December 9-12, 2021, Jan 2021
In the clinical domain, patient states such as sepsis due to bloodstream infection (BSI) result in observable symptoms and signs used to determine diagnosis and treatment, all of which often is documented in electronic health records. However, clinical text is brief and implicit, making it challenging to infer patient conditions by reasoning tasks and supervised machine learning. To study sepsis-related BSIs, we developed an ontology from an annotation guideline and annotated corpus that empirically captures BSIs from adverse event notes containing procedural deviations, guideline deviations, and unwanted incidents that can bring harm to patients. The resulting ontology represents (1) the physical patient state, clinical observations, and clinical documentation, and (2) background clinical knowledge for artificial intelligence, reasoning, and machine learning.
- IEEE BIBMPreliminary Processing and Analysis of an Adverse Event Dataset for Detecting Sepsis-Related EventsMelissa Y Yan, Lise Husby Høvik, André Pedersen, Lise Tuset Gustad, and Øystein NytrøIn IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2021, Houston, TX, USA, December 9-12, 2021, Jan 2021
Adverse event (AE) reports contain notes detailing procedural and guideline deviations, and unwanted incidents that can bring harm to patients. Available datasets mainly focus on vigilance or post-market surveillance of adverse drug reactions or medical device failures. The lack of clinical-related AE datasets makes it challenging to study healthcare-related AEs. AEs affect 10% of hospitalized patients, and almost half are preventable. Having an AE dataset can assist in identifying possible patient safety interventions and performing quality surveillance to lower AE rates. The free-text notes can provide insight into the cause of incidents and lead to better patient care. The objective of this study is to introduce a Norwegian AE dataset and present preliminary processing and analysis for sepsis-related events, specifically peripheral intravenous catheter-related bloodstream infections. Therefore, the methods focus on performing a domain analysis to prepare and better understand the data through screening, generating synthetic free-text notes, and annotating notes.
2019
- BioinformaticsVariantQC: a visual quality control report for variant evaluationMelissa Y Yan, B Ferguson, and B BimberBioinformatics, Jul 2019
SUMMARY Large scale genomic studies produce millions of sequence variants, generating datasets far too massive for manual inspection. To ensure variant and genotype data are consistent and accurate, it is necessary to evaluate variants prior to downstream analysis using quality control (QC) reports. Variant call format (VCF) files are the standard format for representing variant data; however, generating summary statistics from these files is not always straightforward. While tools to summarize variant data exist, they generally produce simple text file tables, which still require additional processing and interpretation. VariantQC fills this gap as a user friendly, interactive visual QC report that generates and concisely summarizes statistics from VCF files. The report aggregates and summarizes variants by dataset, chromosome, sample, and filter type. The VariantQC report is useful for high-level dataset summary, quality control, and helps flag outliers. Furthermore, VariantQC operates on VCF files, so it can be easily integrated into many existing variant pipelines. AVAILABILITY DISCVRSeq’s VariantQC tool is freely available as a Java program, with the compiled JAR and source code available from https://github.com/BimberLab/DISCVRSeq/. Documentation and example reports are available at https://bimberlab.github.io/DISCVRSeq/.
- BMC Genom.mGAP: the macaque genotype and phenotype resource, a framework for accessing and interpreting macaque variant data, and identifying new models of human diseaseBenjamin N Bimber, Melissa Y Yan, Samuel M Peterson, and Betsy FergusonBMC Genomics, Mar 2019
BACKGROUND: Non-human primates (NHPs), particularly macaques, serve as critical and highly relevant pre-clinical models of human disease. The similarity in human and macaque natural disease susceptibility, along with parallel genetic risk alleles, underscores the value of macaques in the development of effective treatment strategies. Nonetheless, there are limited genomic resources available to support the exploration and discovery of macaque models of inherited disease. Notably, there are few public databases tailored to searching NHP sequence variants, and no other database making use of centralized variant calling, or providing genotype-level data and predicted pathogenic effects for each variant. RESULTS: The macaque Genotype And Phenotype (mGAP) resource is the first public website providing searchable, annotated macaque variant data. The mGAP resource includes a catalog of high confidence variants, derived from whole genome sequence (WGS). The current mGAP release at time of publication (1.7) contains 17,087,212 variants based on the sequence analysis of 293 rhesus macaques. A custom pipeline was developed to enable annotation of the macaque variants, leveraging human data sources that include regulatory elements (ENCODE, RegulomeDB), known disease- or phenotype-associated variants (GRASP), predicted impact (SIFT, PolyPhen2), and sequence conservation (Phylop, PhastCons). Currently mGAP includes 2767 variants that are identical to alleles listed in the human ClinVar database, of which 276 variants, spanning 258 genes, are identified as pathogenic. An additional 12,472 variants are predicted as high impact (SnpEff) and 13,129 are predicted as damaging (PolyPhen2). In total, these variants are predicted to be associated with more than 2000 human disease or phenotype entries reported in OMIM (Online Mendelian Inheritance in Man). Importantly, mGAP also provides genotype-level data for all subjects, allowing identification of specific individuals harboring alleles of interest. CONCLUSIONS: The mGAP resource provides variant and genotype data from hundreds of rhesus macaques, processed in a consistent manner across all subjects ( https://mgap.ohsu.edu ). Together with the extensive variant annotations, mGAP presents unprecedented opportunity to investigate potential genetic associations with currently characterized disease models, and to uncover new macaque models based on parallels with human risk alleles.
- Genome Res.Single-cell sequencing of primate preimplantation embryos reveals chromosome elimination via cellular fragmentation and blastomere exclusionBrittany L Daughtry, Jimi L Rosenkrantz, Nathan H Lazar, Suzanne S Fei, Nash Redmayne, Kristof A Torkenczy, Andrew Adey, Melissa Yan, and 5 more authorsGenome Research, Mar 2019
Aneuploidy that arises during meiosis and/or mitosis is a major contributor to early embryo loss. We previously showed that human preimplantation embryos encapsulate missegregated chromosomes into micronuclei while undergoing cellular fragmentation and that fragments can contain chromosomal material, but the source of this DNA was unknown. Here, we leveraged the use of a nonhuman primate model and single-cell DNA-sequencing (scDNA-seq) to examine the chromosomal content of 471 individual samples comprising 254 blastomeres, 42 polar bodies, and 175 cellular fragments from a large number (N = 50) of disassembled rhesus cleavage-stage embryos. Our analysis revealed that the aneuploidy and micronucleation frequency is conserved between humans and macaques, and that fragments encapsulate whole and/or partial chromosomes lost from blastomeres. Single-cell/fragment genotyping showed that these chromosome-containing cellular fragments (CCFs) can be maternally or paternally derived and display double-stranded DNA breaks. DNA breakage was further indicated by reciprocal subchromosomal losses/gains between blastomeres and large segmental errors primarily detected at the terminal ends of chromosomes. By combining time-lapse imaging with scDNA-seq, we determined that multipolar divisions at the zygote or two-cell stage were associated with CCFs and generated a random mixture of chromosomally normal and abnormal blastomeres with uniparental or biparental origins. Despite frequent chromosome missegregation at the cleavage-stage, we show that CCFs and nondividing aneuploid blastomeres showing extensive DNA damage are prevented from incorporation into blastocysts. These findings suggest that embryos respond to chromosomal errors by encapsulation into micronuclei, elimination via cellular fragmentation, and selection against highly aneuploid blastomeres to overcome chromosome instability during preimplantation development.