About | People | Data | Publications

About

This project is devoted to building a large multilingual semantic network through the application of novel techniques for semantic analysis specifically targeted at the Wikipedia corpus. The driving hypothesis of the project is that the structure of Wikipedia can be effectively used to create a highly structured graph of world knowledge in which nodes correspond to entities and concepts described in Wikipedia, while edges capture ontological relations such as hypernymy and meronymy. Special emphasis is given to exploiting the multilingual information available in Wikipedia in order to improve the performance of each semantic analysis tool. Significant research effort is therefore aimed at developing tools for word sense disambiguation, reference resolution and the extraction of ontological relations that use multilingual reinforcement and the consistent structure and focused content of Wikipedia to solve these tasks accurately. An additional research challenge is the effective integration of inherently noisy evidence from multiple Wikipedia articles in order to increase the reliability of the overall knowledge encoded in the global Wikipedia graph. Computing probabilistic confidence values for every piece of structural information added to the network is an important step in this integration, and it is also meant to provide increased utility for downstream applications. The proposed highly structured semantic network complements existing semantic resources and is expected to have a broad impact on a wide range of natural language processing applications in need of large scale world knowledge.

The project is a collaboration between the Language and Information Technologies group at University of North Texas and the Natural Language Processing group at Ohio University. The project is sponsored by the National Science Foundation, under awards #1018613 and #1018590.

People

Data and Code

  • WPCoarse2Fine-Data -- A disambiguation dataset containing Wikipedia links and their contexts for 6 ambiguous words. Occurrences with coarse sense annotations have been manually annotated with finer senses, in order to evaluate the WSD approaches described in: Shen, Bunescu, and Mihalcea, "Coarse To fine Word Sense Disambiguation in Wikipedia", 2013. [download]
  • WPCoarse2Fine-Code -- This package contains the implementation of the semi- supervised learning approaches to WSD in Wikipedia, as described in: Shen, Bunescu, and Mihalcea, "Coarse To fine Word Sense Disambiguation in Wikipedia", 2013. [download]
  • WikiSenseClusters -- This package contains several datasets built to evaluate the automatic sense clustering method: two that are generated automatically through a set of heuristics applied on clusters extracted from existing disambiguation pages in English or Spanish, and two that are obtained through manual annotations. Additionally, a dataset was also constructed by clustering a set of Semeval word senses. All datasets follow the same format, and consist of pairs of articles annotated as either positive or negative, depending on whether they should be grouped together under one sense or not. [download]
  • WPInterlingua -- A resource containing 195 pairs of articles in Wikipedia, covering four language pairs, manually annotated for translation equivalence. The metafile, containing all the candidate interlingual links for ten language pairs, is also available. More details can be found in Dandala, Mihalcea, and Bunescu, "Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia," June 2012. [download].
    The metafile, containing all the candidate interlingual links for ten lang uage pairs can also be [downloaded]
  • WPSenseReference -- A disambiguation dataset containing links extracted from Wikipedia for four ambiguous words. Occurrences with inconsistent sense annotations have been manually annotated with more specific senses or references. More details can be found in: Shen, Bunescu, and Mihalcea, "Sense and Reference Disambiguation in Wikipedia", August 2012. [download]
  • AdaptiveHAC -- Source code of Java package that implements the adaptive clustering algorithm used for extending the state-of-the-art Stanford deterministic coreference system with semantic compatibility features. Detailed description is given in in: Razvan Bunescu, "An Adaptive Clustering Model that Integrates Expert Rules and N-gram Statistics for Coreference Resolution", ECAI 2012. [download]
  • WPCat -- a Wikipedia taxonomic relation dataset. It contains ten text files, each corresponding to one root category from Wikipedia. Each file contains a directed acyclic graph of categories and titles sampled automatically from the Wikipedia category graph as descendants of the corresponding root category. Node-to-parent and node-to-root pairs have been manually annotated for is-a and instance-of relations. More details can be found in: Mike Chen and Razvan Bunescu, Taxonomic Relation Extraction from Wikipedia: Datasets and Algorithms, Technical Report, June 2011. [download]
  • WPCoref -- a Wikipedia (co)reference dataset. It contains three large Wikipedia articles (John Williams, Barack Obama, and The New York Times) that were manually annotated with coreference and reference information. Coreference relations were annotated for all markable noun phrases, similar to the MUC guidelines. Furthermore, each coreference chain was manually linked to the Wikipedia title that describes the corresponding entity, if such a title exists. The files are in the AIF format recognized by the Callisto annotation interface. More details can be found in: Razvan Bunescu, (Co)Reference Resolution in Wikipedia, Technical Report, August 2011. [download]

Publications

  • Hui Shen, Razvan Bunescu, and Rada Mihalcea. Coarse to Fine Grained Sense Disambiguation in Wikipedia. Joint Conference on Lexical and Computational Semantics (*SEM). Atlanta, GA, 2013 [pdf]
  • Bharath Dandala, Chris Hokamp, Rada Mihalcea, and Razvan Bunescu. Sense Clustering Using Wikipedia. Recent Advances in Natural Language Processing (RANLP). Hissar, Bulgaria, 2013.
  • Bharath Dandala, Rada Mihalcea, and Razvan Bunescu. Multilingual Word Sense Disambiguation using Wikipedia, International Joint Conference on Natural Language Processing (IJCNLP). Nagoya, Japan, 2013.
  • Bharath Dandala. Multilingual Word Sense Disambiguation using Wikipedia. PhD Dissertation, University of North Texas, 2013.
  • Hui Shen, Razvan Bunescu, Rada Mihalcea, Sense and Reference Disambiguation in Wikipedia. In Proceedings of the 24th International Conference on Computational Linguistics (COLING), Bombay, India, 2012 [pdf]
  • Bharath Dandala, Rada Mihalcea, Razvan Bunescu, Word Sense Disambiguation using Wikipedia. In "The People's Web Meets NLP: Collaboratively Constructed Language Resources", Springer book series "Theory and Applications of Natural Language Processing", editors Iryna Gurevych and Jungi Kim, 2012.
  • Razvan Bunescu, An Adaptive Clustering Model that Integrates Expert Rules and N-gram Statistics for Coreference Resolution, Proceedings of the 20th European Conference on Artificial Intelligence, Montpellier, France, August 2012. [pdf]
  • Bharath Dandala, Rada Mihalcea, Razvan Bunescu, Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia, Proceedings of the First Joint Conference on Lexical and Computational Semantics, Montreal, Canada, June 2012. [pdf]
  • Razvan Bunescu, Adaptive Clustering for Coreference Resolution with Deterministic Rules and Web-Based Language Models. Proceedings of the First Joint Conference on Lexical and Computational Semantics, Montreal, Canada, June 2012. [pdf]
  • Erwin Fernandez-Ordonez, Rada Mihalcea, and Samer Hassan, Unsupervised Word Sense Disambiguation with Multilingual Representations, in Proceedings of the Conference on Language Resources and Evaluations (LREC 2012), Istanbul, Turkey, May 2012. [pdf]
  • Carmen Banea and Rada Mihalcea, Word Sense Disambiguation with Multilingual Features, International Conference on Semantic Computing, Oxford, UK, January 2011. [pdf]
  • Samer Hassan and Rada Mihalcea, Corpus-based and Knowledge-based Measures of Semantic Relatedness, in Proceedings of the American Association for Artificial Intelligence (AAAI 2011), San Francisco, August, 2011. [pdf]
  • Mike Chen and Razvan Bunescu, Taxonomic Relation Extraction from Wikipedia: Datasets and Algorithms, Technical Report, June 2011. [pdf]
  • Razvan Bunescu, (Co)Reference Resolution in Wikipedia, Technical Report, August 2011. [pdf]


Last modified 08/21/2013