Discourse relations such as ‘contrast’, ‘cause’ or ‘evidence’ are often postulated to explain how humans understand the function of one sentence in relation to another. Some relations are signaled rather directly using words such as “because” or “on the other hand”, but often signals are highly ambiguous or remain implicit, and cannot be associated with specific words. This opens up questions regarding how exactly we recognize relations and what kinds of computational models we can build to account for them.
In this talk I will explore models capturing discourse signals in the framework of Rhetorical Structure Theory (Mann & Thompson 1988), using data from the RST Signaling Corpus (Taboada & Das 2013) and a richly annotated corpus called GUM (Zeldes 2017). Using manually annotated data indicating the presence of lexical and implicit signals, I will show that purely text based models using RNNs and word embeddings inevitably miss important aspects of discourse structure. I will argue that richly annotated data beyond the textual level, including syntactic and semantic information, is required to form a more complete picture of discourse relations in text.
Amir Zeldes is assistant professor of Computational Linguistics at Georgetown University, specializing in Corpus Linguistics. He studied Cognitive Science, Linguistics and Computational Linguistics in Jerusalem, Potsdam, and Berlin, receiving his PhD in Linguistics from Humboldt University in 2012. His interests center on the syntax-semantics interface, where meaning and knowledge about the world are mapped onto language-specific choices. His most recent work focuses on computational discourse models which reflect common ground and communicative intent across sentences. He is also involved in the development of tools for corpus search, annotation and visualization, and has worked on representations of textual data in Linguistics and the Digital Humanities.
February 23, 2018 @ 12:00 pm – 1:15 pm
Hackerman Hall B17
3400 N Charles St
Baltimore, MD 21218
Naturally, the literary scholar is often concerned with more context than can be conveniently displayed in a KWIC concordance, which is why most literarily oriented concordance interfaces offer hyperlinking functionality between concordances and expanded context views of the corpus. The advantage of using both views in conjunction is that potentially interesting results can be reviewed easily in the plain-text concordance, possibly with helpful highlighting functions and annotations, whereas a detailed view navigated to from this list can contain both more text, and representations that are more taxing to interpret, such as aligned facsimiles. A good example of this mode of operation can be found in the Canterbury Tales Project, which also offers special marking for variants in the collation, so that different versions of a search result can be navigated to on the fly. Although these functions have been developed largely with literary computing in mind, they are entirely applicable to corpus linguistics as well. Many linguistic domains require relatively large contexts, and many corpora correspondingly offer not only adjustable context width for concordances, but also dedicated text-length context views, which are especially appropriate for studying text-wide dependencies. The rhetorical structure annotated in the above mentioned Potsdam Commentary Corpus, for example, cannot be adequately interpreted without very large context, and often requires reading an entire text. Corpora comprised of short news stories or essays can also be studied at text level, using searches to retrieve text containing interesting phenomena. This allows researchers, for instance, to study constructions typical of the beginning or end of a text, and their dependencies on various features being found in or absent from the entire text. This means that the same corpus can be exploited by researchers in different fields, or even used to examine interdependencies between different layers (for example the effect of information structure on syntax). More and more types of annotation, often created by work-intensive manual methods, are being proliferated, for example verbal argument annotations in PropBank and discourse annotations for connectives like because or although in the Penn Discourse Treebank. New research methods taking advantage of such annotations simultaneously may reveal as yet unknown interactions between different linguistic levels.