
About
This project is devoted to building a large multilingual semantic network through the application of novel techniques for semantic analysis specifically targeted at the Wikipedia corpus. The driving hypothesis of the project is that the structure of Wikipedia can be effectively used to create a highly structured graph of world knowledge in which nodes correspond to entities and concepts described in Wikipedia, while edges capture ontological relations such as hypernymy and meronymy. Special emphasis is given to exploiting the multilingual information available in Wikipedia in order to improve the performance of each semantic analysis tool. Significant research effort is therefore aimed at developing tools for word sense disambiguation, reference resolution and the extraction of ontological relations that use multilingual reinforcement and the consistent structure and focused content of Wikipedia to solve these tasks accurately. An additional research challenge is the effective integration of inherently noisy evidence from multiple Wikipedia articles in order to increase the reliability of the overall knowledge encoded in the global Wikipedia graph. Computing probabilistic confidence values for every piece of structural information added to the network is an important step in this integration, and it is also meant to provide increased utility for downstream applications. The proposed highly structured semantic network complements existing semantic resources and is expected to have a broad impact on a wide range of natural language processing applications in need of large scale world knowledge.
The project is a collaboration between the Language and Information Technologies group at University of Michigan and the Natural Language Processing group at Ohio University. The project is sponsored by the National Science Foundation, under awards #1018613 and #1018590.
People
Razvan Bunescu (PI)
Rada Mihalcea (PI)
Mike Chen
Jincheng Chen
Bharath Dandala
Samer Hassan
Kevin Janowiecki
Hui Shen
Data
WPGraphDB — This is a Neo4j graph database that contains taxonomic relations extracted automatically from the Wikipedia category graph. The taxonomic relation extraction system and the database are described in Shen et al., “Wikipedia Taxonomic Relation Extraction using Wikipedia Distant Supervision”, 2014. [download]
WPCoarse2Fine-Data — A disambiguation dataset containing Wikipedia links and their contexts for 6 ambiguous words. Occurrences with coarse sense annotations have been manually annotated with finer senses, in order to evaluate the WSD approaches described in: Shen, Bunescu, and Mihalcea, “Coarse To fine Word Sense Disambiguation in Wikipedia”, 2013. [download]
WPCoarse2Fine-Code — This package contains the implementation of the semi- supervised learning approaches to WSD in Wikipedia, as described in: Shen, Bunescu, and Mihalcea, “Coarse To fine Word Sense Disambiguation in Wikipedia”, 2013. [download]
WikiSenseClusters — This package contains several datasets built to evaluate the automatic sense clustering method: two that are generated automatically through a set of heuristics applied on clusters extracted from existing disambiguation pages in English or Spanish, and two that are obtained through manual annotations. Additionally, a dataset was also constructed by clustering a set of Semeval word senses. All datasets follow the same format, and consist of pairs of articles annotated as either positive or negative, depending on whether they should be grouped together under one sense or not. [download]
WPInterlingua — A resource containing 195 pairs of articles in Wikipedia, covering four language pairs, manually annotated for translation equivalence. The metafile, containing all the candidate interlingual links for ten language pairs, is also available. More details can be found in Dandala, Mihalcea, and Bunescu, “Towards Building a Multilingual Semantic Network: Identifying Interlingual Links in Wikipedia,” June 2012. [download].The metafile, containing all the candidate interlingual links for ten language pairs can also be [downloaded].
WPSenseReference — A disambiguation dataset containing links extracted from Wikipedia for four ambiguous words. Occurrences with inconsistent sense annotations have been manually annotated with more specific senses or references. More details can be found in: Shen, Bunescu, and Mihalcea, “Sense and Reference Disambiguation in Wikipedia”, August 2012. [download]
AdaptiveHAC — Source code of Java package that implements the adaptive clustering algorithm used for extending the state-of-the-art Stanford deterministic coreference system with semantic compatibility features. Detailed description is given in in: Razvan Bunescu, “An Adaptive Clustering Model that Integrates Expert Rules and N-gram Statistics for Coreference Resolution”, ECAI 2012. [download]
WPCat — a Wikipedia taxonomic relation dataset. It contains ten text files, each corresponding to one root category from Wikipedia. Each file contains a directed acyclic graph of categories and titles sampled automatically from the Wikipedia category graph as descendants of the corresponding root category. Node-to-parent and node-to-root pairs have been manually annotated for is-a and instance-of relations. More details can be found in: Mike Chen and Razvan Bunescu, Taxonomic Relation Extraction from Wikipedia: Datasets and Algorithms, Technical Report, June 2011. [download]
WPCoref — a Wikipedia (co)reference dataset. It contains three large Wikipedia articles (John Williams, Barack Obama, and The New York Times) that were manually annotated with coreference and reference information. Coreference relations were annotated for all markable noun phrases, similar to the MUC guidelines. Furthermore, each coreference chain was manually linked to the Wikipedia title that describes the corresponding entity, if such a title exists. The files are in the AIF format recognized by the Callisto annotation interface. More details can be found in: Razvan Bunescu, (Co)Reference Resolution in Wikipedia, Technical Report, August 2011. [download]