Resources

Resources and Tools for Computational Historical Linguistics

The Java code for automatically identifying and producing related words for historical linguistics: link.
The translations used for dictionary-based identification of cognates: link.
The input and output files for experiments on identification and production of relate words: link.

We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated methods, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve good grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems.
Follow the steps, in order to generate simplified text:
1. Checkout our repository including the submodules: git clone --recursive https://github.com/senisioi/NeuralTextSimplification.git
2. Download the pre-trained released models NTS and NTS-w2v (this may take a while): python src/download_models.py ./models
3. Run translate.sh from the scripts dir: cd src/scripts && ./translate.sh

A complete description of this resource is available here: A Corpus of Native, Non-native and Translated Texts, LREC, 2016, PDF
For the raw corpus, please check the dataset available here
For the experiments presented in the ACL 2016 paper, please check the dataset available here
For the experiments presented in the LREC 2016 paper, please check the dataset available here

This is a monolingual English corpus of native, non-native and (human) translated texts extracted from the European Parliament. The translated texts from different source languages represent a subset of the Haifa Corpus of Translationese. We preserved the same annotation style and included an ID and the EU state that each member of the European Parliament represents.
We hope this dataset will facilitate a unified comparative study of translations and language produced by highly fluent non-native speakers, two closely-related phenomena that have only been studied in isolation so far.

This work is the result of our collaboration with Anca Bucur, Ph.D. candidate, from the Center of Excellence in Image Study.
We compile a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-Philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.

We provide a comparison of speech and text classification of native and non-native English using a subset of the International Corpus Network of Asian Learners of English (ICNALE)
The analysis is reported in the paper Nisioi, S., Comparing Speech and Text Classification on ICNALE, LREC 2016

The first version of Romanian Determiners Lexicon (RoDetLexicon 1.1) specifies the relevant features for determiners studied so far during the research project “The structure and interpretation of Romanian Determiner Phrase in Discourse Representation Theory: the determiners”. The importance of determiners comes from both syntax and semantics. From the point of view of syntactic theory, specifying the determiner’s relevant features naturally leads to the determination of the parameters of syntactic variation in the Determiner Phrase domain. From the discursive perspective, determinants have a fundamental role, being the most important constituents when it comes to establishing the logical structure of the sentence or of the discourse.
The feature matrix of each determiner contains morpho-syntactic and semantic features, as they emerged from the studies developed during the project, such as: syntactic category, selectional features, phi-features (person, number, gender), definiteness, quantificational features, cardinality, focus, topic, deixis, proximity, contrastive, location, anaphoric, cataphoric or classifier.
More details are available this paper.

More details are available in Ciobanu, A.M. and Dinu, L.P., An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian, EMNLP 2014 PDF

Experiments on named entity translation using word embeddings are described in Şulea, O. M., Nisioi, S., and Dinu, L. P.,:, Using Word Embeddings to Translate Named Entities, LREC2016
This resource is an annotated parallel corpus of named entities, currently work in progress

More details about this resource can be found in Dinu, L. P., Iordache, I. B., Uban, A. S., Zampieri, M.: A Computational Exploration of Pejorative Language in Social Media, Findings of the Association for Computational Linguistics: EMNLP 2021.
the dataset can be downloaded via this link.

More details about this resource can be found in Uban, Ana-Sabina, Berta Chulvi, and Paolo Rosso. “Explainability of depression detection on social media: From deep learning models to psychological interpretations and multimodality.” Early Detection of Mental Health Disorders by Social Media Monitoring: The First Five Years of the eRisk Project. Cham: Springer International Publishing, 2022. 289-320.
Please email Ana Uban for access to the dataset.