||Curation-oriented Recognition and Retrieval from Biomedical Literature
||Institute of Computer Science and Information Engineering
Biomedical text mining
Biomedical term synonymity
Curatable biomedical term
With a huge increase of biomedical literature, there has been an upsurge need for integrating text mining and machine learning in biological databases. Many databases have collected specific topics and corresponding resources, such as experimental data and research literature. However, processing unstructured data through text mining is a complex and dynamic area, which interests different disciplines (e.g., chemists, biologists, and computer scientists. To automatically extract knowledge from texts and effectively confirm the knowledge recorded in biological databases, the biomedical named-entity recognition (NER) and document triage have been considered as more challenging tasks. Thus, we focus on the two major topics in this dissertation.
Determining the semantic relatedness of two biomedical terms is an important task for many text-mining applications in the biomedical field. Previous studies, such as those using ontology-based and corpus-based approaches, measured semantic relatedness by using information from the structure of biomedical literature, but these methods are limited by the small size of training resources. To increase the size of training datasets, the outputs of search engines have been used extensively to analyze the lexical patterns of biomedical terms. In this work, we propose the Mutually Reinforcing Lexical Pattern Ranking (ReLPR) algorithm for learning and exploring the lexical patterns of synonym pairs in biomedical text. ReLPR employs lexical patterns and their pattern containers to assess the semantic relatedness of biomedical terms. By combining sentence structures and the linking activities between containers and lexical patterns, our algorithm can explore the correlation between two biomedical terms.
NER plays an important role in the development of biological databases. However, the existing NER tools produce multifarious named-entities which may result in both curatable and non-curatable markers. To facilitate biocuration with a straightforward approach, classifying curatable named-entities is helpful with regard to accelerating the biocuration workflow. Co-occurrence Interaction Nexus with Named-entity Recognition (CoINNER) is a web-based tool that allows users to identify genes, chemicals, diseases, and action term mentions in the Comparative Toxicogenomic Database (CTD). We extended our previous system in developing CoINNER. The pre-tagging results of CoINNER were developed based on the state-of-the-art named entity recognition tools in BioCreative III. Next, a method based on conditional random fields (CRFs) is proposed to predict chemical and disease mentions in the articles. Finally, action term mentions were collected by latent Dirichlet allocation (LDA). The results of the CoINNER were significantly superior to those of previous methods.
In recent years, there was a rapid increase in the number of medical articles. The number of articles in PubMed has increased exponentially. Thus, the workload for biocurators has also increased exponentially. Under these circumstances, a system that can automatically determine in advance which article has a higher priority for curation can effectively reduce the workload of biocurators. Determining how to effectively find the articles required by biocurators has become an important task, the Article Classification Task (ACT). In the BioCreative 2012 workshop, we proposed the Co-occurrence Interaction Nexus (CoIN) for learning and exploring relations in articles. We constructed a co-occurrence analysis system, which is applicable to PubMed articles and suitable for gene, chemical and disease queries. CoIN uses co-occurrence features and their network centralities to assess the influence of curatable articles from the Comparative Toxicogenomics Database. The experimental results show that our network-based approach combined with co-occurrence features can effectively classify curatable and non-curatable articles. CoIN also allows biocurators to retrieve the related articles for specific queries without reviewing meaningless information. At the BioCreative CTD ACT Task, CoIN achieved a 0.778 mean average precision in the triage task, thus finishing in second place out of all participants.
TABLE OF CONTENT VIII
LIST OF FIGURE X
LIST OF TABLE XII
CHAPTER 1: INTRODUCTION 1
1.1 OVERVIEW OF THE DISSERTATION 6
1.2 INTRODUCTION OF BIOMEDICAL TERM SYNONYMITY 8
1.3 INTRODUCTION OF CURATABLE BIOMEDICAL TERMS 9
1.4 INTRODUCTION OF RETRIEVAL SYSTEMS FOR BIOCURATION 10
CHAPTER 2: ASSESS THE SEMANTIC RELATEDNESS OF BIOMEDICAL TERMS 13
2.1 INTRODUCTION 13
2.2 PROBLEM STATEMENT 17
2.3 SYSTEM FRAMEWORK OF RELPR 23
2.3.1 Acquisition of Synonym Pairs 24
2.3.2 Crawl Concept Pairs from Search Engines 25
2.3.3 Extracting Lexical Patterns from Snippets 26
2.3.4 ReLPR: Mutually Reinforcing Lexical Pattern Ranking Algorithm 27
2.3.5 Measuring the Semantic Relatedness 30
2.4 EVALUATION OF BIOMEDICAL CONCEPT PAIRS 31
2.5 COMPARISON OF PREVIOUS APPROACHES 39
2.6 SUMMARY 41
CHAPTER 3: CURATABLE BIOMEDICAL TERM RECOGNITION 42
3.1 INTRODUCTION 42
3.2 COINNER ARCHITECTURE 49
3.2.1 A Curatable Sentence Classifier 50
3.2.2 Gene/chemical/disease Named-entity Recognition 55
3.2.3 Action Term Named-entity Recognition 55
3.3 EVALUATION OF THE BIOCREATIVE CTD NER TASK 57
3.4 SUMMARY 63
CHAPTER 4: DOCUMENT TRIAGE SYSTEM FOR BIOMEDICAL LITERATURE 65
4.1 INTRODUCTION 65
4.2 APPROCAHES TO INFORAMTION RETREIVAL IN BIOLOGY 66
4.3 COIN ARCHITERURE 69
4.3.1 Curation Workflow 70
4.3.2 Co-occurrence Models 75
4.3.3 Network-Based Models 75
4.4 EVALUATION OF THE BIOCREATIVE CTD ACT TASK 78
4.5 SUMMARY 85
CHAPTER 5: CONCLUSIONS 87
 Review of WordNet: an electronic lexical database by Christiane Fellbaum. The MIT Press 1998, Comput. Linguist., vol. 25, pp. 292-296, 1999.
 S. Aerts, D. Lambrechts, S. Maity, P. Van Loo, B. Coessens, F. De Smet, et al., Gene prioritization through genomic data fusion, Nat Biotechnol, vol. 24, pp. 537-44, May 2006.
 H. Al-Mubaid and H. A. Nguyen, Measuring Semantic Similarity Between Biomedical Concepts Within Multiple Ontologies, IEEE Transactions on Systems, Man, and Cybernetics, Part C, vol. 39, pp. 389-398, 2009.
 C. N. Arighi, P. M. Roberts, S. Agarwal, S. Bhattacharya, G. Cesareni, A. Chatr-Aryamontri, et al., BioCreative III interactive task: an overview, BMC Bioinformatics, vol. 12 Suppl 8, p. S4, 2011.
 C. N. Arighi, C. H. Wu, K. B. Cohen, L. Hirschman, M. Krallinger, A. Valencia, et al., BioCreative-IV virtual issue, Database (Oxford), vol. 2014, 2014.
 A. R. Aronson, Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program, Proc AMIA Symp, pp. 17-21, 2001.
 M. Ashburner, C. Ball, J. Blake, D. Botstein, and H. Butler, Gene Ontology: Tool for the Unification of Biology, Nature Genetics, vol. 25, pp. 25-29, 2000.
 N. Atias and R. Sharan, Comparative analysis of protein networks: hard problems, practical solutions, Commun. ACM, vol. 55, pp. 88-97, 2012.
 M. Bada, M. Eckert, D. Evans, K. Garcia, K. Shipley, D. Sitnikov, et al., Concept annotation in the CRAFT corpus, BMC Bioinformatics, vol. 13, p. 161, 2012.
 D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, Latent Dirichlet allocation, Journal of Machine Learning Research, vol. 3, p. 2003, 2003.
 O. Bodenreider and R. Stevens, Bio-ontologies: current trends and future directions, Briefings in Bioinformatics, vol. 7, pp. 256-274, 2006.
 D. Bollegala, Y. Matsuo, and M. Ishizuka, A Web Search Engine-Based Approach to Measure Semantic Similarity between Words, IEEE Trans. on Knowl. and Data Eng., vol. 23, pp. 977-990, 2011.
 D. Bollegala, Y. Matsuo, and M. Ishizuka, Measuring semantic similarity between words using web search engines, in WWW '07: Proceedings of the 16th international conference on World Wide Web, New York, NY, USA, 2007, pp. 757-766.
 U. Brandes, A faster algorithm for betweenness centrality, Journal of Mathematical Sociology, vol. 25, pp. 163-177, 2001.
 S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, in COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, pp. 107-117.
 J. E. Caviedes and J. J. Cimino, Towards the development of a conceptual distance metric for the UMLS, Journal of Biomedical Informatics, vol. 37, pp. 77-85, 2004.
 C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines, ACM Transactions on Intelligent Systems and Technology, vol. 2, pp. 27:1-27:27, 2011.
 A. Chatr-aryamontri, A. Ceol, L. M. Palazzi, G. Nardelli, M. V. Schneider, L. Castagnoli, et al., MINT: the Molecular INTeraction database, Nucleic Acids Research, vol. 35, pp. D572-D574, 2007.
 C. H. Chen, S. L. Hsieh, Y. C. Weng, W. Y. Chang, and F. Lai, Semantic similarity measure in biomedical domain leverage web search engine, Conf Proc IEEE Eng Med Biol Soc, vol. 2010, pp. 4436-9, 2010.
 H. Chen, M. Lin, and Y. Wei, Novel Association Measures Using Web Search with Double Checking, in Proceeding ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics 2006, pp. 1009-1016.
 V. Cherkassky and Y. Ma, Practical selection of SVM parameters and noise estimation for SVM regression, Neural Networks, vol. 17, pp. 113-126, 2004.
 K. W. Church and P. Hanks, Word Association Norms, Mutual Information, and Lexicography, Computational Linguistics, vol. 1, pp. 22-29, 1990.
 R. L. Cilibrasi and P. M. B. Vitani, The Google Similarity Distance, IEEE Transactions on Knowledge and Data Engineering, vol. 19, pp. 370-383 2007.
 A. M. Cohen and W. R. Hersh, A survey of current work in biomedical text mining, Brief Bioinform, vol. 6, pp. 57-71, Mar 2005.
 P. Corbett and A. Copestake, Cascaded classifiers for confidence-based chemical named entity recognition, BMC Bioinformatics, vol. 9 Suppl 11, p. S4, 2008.
 A. P. Davis, T. C. Wiegers, R. J. Johnson, J. M. Lay, K. Lennon-Hopkins, C. Saraceni-Richards, et al., Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database, PLoS One, vol. 8, p. e58201, 2013.
 S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, Journal of the American society for information science, vol. 41, pp. 391-407, 1990.
 I. Donaldson, J. Martin, B. de Bruijn, C. Wolting, V. Lay, B. Tuekam, et al., PreBIND and Textomy-mining the biomedical literature for protein-protein interactions using a support vector machine, BMC Bioinformatics, vol. 4, p. 11, Mar 27 2003.
 L. Eronen and H. Toivonen, Biomine: predicting links between biological entities using network models of heterogeneous databases, BMC Bioinformatics, vol. 13, p. 119, 2012.
 A. Faro, D. Giordano, and C. Spampinato, Combining literature text mining with microarray data: advances for system biology modeling, Brief Bioinform, vol. 13, pp. 61-82, Jan 2012.
 L. Franke, H. van Bakel, L. Fokkens, E. D. de Jong, M. Egmont-Petersen, and C. Wijmenga, Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes, Am J Hum Genet, vol. 78, pp. 1011-25, Jun 2006.
 L. C. Freeman, A set of measures of centrality based upon betweenness, Sociometry, vol. 40, pp. 35-41, 1977.
 S. Furney, M. M. Alba, and N. Lopez-Bigas, Differences in the evolutionary history of disease genes affected by dominant or recessive mutations, BMC Genomics, vol. 7, p. 165, 2006.
 A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, et al., Functional organization of the yeast proteome by systematic analysis of protein complexes, Nature, vol. 415, pp. 141-7, Jan 10 2002.
 T. Hernandez-Boussard, M. Whirl-Carrillo, J. M. Hebert, L. Gong, R. Owen, M. Gong, et al., The pharmacogenetics and pharmacogenomics knowledge base: accentuating the knowledge, Nucleic Acids Res, vol. 36, pp. D913-8, Jan 2008.
 A. Hliaoutakis, Semantic Similarity Measures in MeSH Ontology and their application to Information Retrieval on Medline, Master Master's thesis, 2005.
 D. Howe, M. Costanzo, P. Fey, T. Gojobori, L. Hannick, W. Hide, et al., Big data: The future of biocuration, Nature, vol. 455, pp. 47-50, 2008.
 C.-W. Hsu, C.-C. Chang, and C.-J. Lin, A Practical Guide to Support Vector Classification, Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, 2003.
 C. N. Hsu, Y. M. Chang, C. J. Kuo, Y. S. Lin, H. S. Huang, and I. F. Chung, Integrating high dimensional bi-directional parsing models for gene mention tagging, Bioinformatics, vol. 24, pp. i286-94, Jul 1 2008.
 Y. Y. Hsu and H. Y. Kao, CoIN: a network analysis for document triage, Database (Oxford), vol. 2013, p. bat076, 2013.
 M. Huang, J. Liu, and X. Zhu, GeneTUKit: a software for document-level gene normalization, Bioinformatics, vol. 27, pp. 1032-3, Apr 1 2011.
 H. Jeong, S. P. Mason, s. Barabasi, A.-L., and Z. N. Oltvai, Lethality and centrality in protein networks, Nature, vol. 411, 2001.
 J. J. Jiang and D. W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, Proceedings of the International Conference Research on Computational Linguistics (ROCLING), vol. cmp-lg/9709008, 1997.
 S. Kerrien, Y. Alam-Faruque, B. Aranda, I. Bancarz, A. Bridge, C. Derow, et al., IntAct—open source resource for molecular interaction data, Nucleic Acids Research, vol. 35, pp. D561-D565, 2007.
 S. Kim and W. J. Wilbur, Classifying protein-protein interaction articles using word and syntactic features, BMC Bioinformatics, vol. 12 Suppl 8, p. S9, 2011.
 S. Kim, W. Kim, C. H. Wei, Z. Lu, and W. J. Wilbur, Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information, Database (Oxford), vol. 2012, p. bas042, 2012.
 J. Kleinberg, Authoritative Sources in a Hyperlinked Environment, Journal of the ACM, vol. 46, pp. 604-632 1998.
 C. J. Kuo, M. H. Ling, K. T. Lin, and C. N. Hsu, BIOADI: a machine learning approach to identifying abbreviations and definitions in biological literature, BMC Bioinformatics, vol. 10 Suppl 15, p. S7, 2009.
 C. Leacock and M. Chodorow, Combining local context and WordNet similarity for word sense identification, in WordNet: An Electronic Lexical Database, ed: In C. Fellbaum (Ed.), MIT Press, 1998, pp. 305-332.
 M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitanyi, The similarity metric, in SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, 2003, pp. 863-872.
 D. Lin, Automatic Retrieval and Clustering of Similar Words, in COLING-ACL98, Montreal, Canada, 1998.
 C. J. Mattingly, G. T. Colby, J. N. Forrest, and J. L. Boyer, The Comparative Toxicogenomics Database (CTD), Environmental Health Perspectives, vol. 111, pp. 793-795, 2003.
 J. McCrae and N. Collier, Synonym set extraction from the biomedical literature by lexical pattern discovery, BMC Bioinformatics, vol. 9, 2008.
 T. Mitsumori, M. Murata, Y. Fukuda, K. Doi, and H. Doi, Extracting Protein-Protein Interaction Information from Biomedical Text with SVM, IEICE Transactions, vol. 89-D, pp. 2464-2466, 2006.
 A. Neveol, R. Islamaj Dogan, and Z. Lu, Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction, J Biomed Inform, vol. 44, pp. 310-8, Apr 2011.
 P.W.Lord, R. D. Stevens, A. Brass, and C.A.Goble, Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation, Bioinformatics, vol. 19, pp. 1275-83, 2003.
 S. Patwardhan and T. Pedersen, Using WordNet-based context vectors to estimate the semantic relatedness of concepts, in Proceedings of the EACL 2006 Workshop Making Sense of Sense-Bringing Computational Linguistics and Psycholinguistics Together, 2006, pp. 1-8.
 T. Pedersen, S. V. Pakhomov, S. Patwardhan, and C. G. Chute, Measures of semantic similarity and relatedness in the biomedical domain, J Biomed Inform, vol. 40, pp. 288-99, Jun 2007.
 S. Pyysalo, T. Ohta, R. Rak, D. Sullivan, C. Mao, C. Wang, et al., Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011, BMC Bioinformatics, vol. 13 Suppl 11, p. S2, 2012.
 R. Rada, H. Mili, E. Bicknell, and M. Blettner, Development and application of a metric on semantic nets, in IEEE Transactions on Systems, Man and Cybernetics, 1989, pp. 17-30.
 P. Resnik, Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language, CoRR, vol. abs/1105.5444, 2011.
 R. Saetre, K. Yoshida, M. Miwa, T. Matsuzaki, Y. Kano, and J. Tsujii, Extracting protein interactions from text with the unified AkaneRE event extraction system, IEEE/ACM Trans Comput Biol Bioinform, vol. 7, pp. 442-53, Jul-Sep 2010.
 M. Sahami and T. D. Heilman, A web-based kernel function for measuring the similarity of short text snippets, in WWW, 2006, pp. 377-386.
 H. Schutze, Automatic Word Sense Discrimination, Computational Linguistics, vol. 24, pp. 97-123, 1998.
 G. Schneider, S. Clematide, and F. Rinaldi, Detection of interaction articles and experimental methods in biomedical literature, BMC Bioinformatics, vol. 12 Suppl 8, p. S13, 2011.
 B. Settles, ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text, Bioinformatics, vol. 21, pp. 3191-2, Jul 15 2005.
 R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng, Parsing with Compositional Vector Grammars, in ACL (1), 2013, pp. 455-465.
 Y. Tsuruoka, M. Miwa, K. Hamamoto, J. Tsujii, and S. Ananiadou, Discovering and visualizing indirect associations between biomedical concepts, Bioinformatics, vol. 27, pp. i111-9, Jul 1 2011.
 V. N. Vapnik, The Nature of Statistical Learning Theory: Springer, 1995.
 C. H. Wei, B. R. Harris, D. Li, T. Z. Berardini, E. Huala, H. Y. Kao, et al., Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts, Database (Oxford), vol. 2012, p. bas041, 2012.
 C. H. Wei and H. Y. Kao, Cross-species gene normalization by species inference, BMC Bioinformatics, vol. 12 Suppl 8, p. S5, 2011.
 T. C. Wiegers, A. P. Davis, and C. J. Mattingly, Web services-based text-mining demonstrates broad impacts for interoperability and process simplification, Database (Oxford), vol. 2014, 2014.
 T. C. Wiegers, A. P. Davis, and C. J. Mattingly, Collaborative biocuration-text-mining development task for document prioritization for curation, Database (Oxford), vol. 2012, p. bas037, 2012.
 W. J. Wilbur and Y. Yang, An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts, Comput Biol Med, vol. 26, pp. 209-22, May 1996.
 C. Winter, G. Kristiansen, S. Kersting, J. Roy, D. Aust, T. Knosel, et al., Google goes cancer: improving outcome prediction for cancer patients by network-based ranking of marker genes, PLoS Comput Biol, vol. 8, p. e1002511, 2012.
 Z. Wu and M. Palmer, Verb Semantics And Lexical Selection, in Proc. of the 32nd annual meeting on Association for Computational Linguistics, 1994, pp. 133-138.
 L. Yao, A. Divoli, I. Mayzus, J. A. Evans, and A. Rzhetsky, Benchmarking Ontologies: Bigger or Better?, PLoS Computational Biology, vol. 7, 2011.
 H. Yu, P. M. Kim, E. Sprecher, V. Trifonov, and M. Gerstein, The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics, PLoS Comput Biol, vol. 3, p. e59, Apr 20 2007.
 A. Zouaq and R. Nkambou, A Survey of Domain Ontology Engineering: Methods and Tools, in Advances in Intelligent Tutoring Systems. vol. 308, R. Nkambou, J. Bourdeau, and R. Mizoguchi, Eds., ed: Springer Berlin Heidelberg, 2010, pp. 103-119.