Resources

Resources and Tools for Computational Historical Linguistics
Text Simplification Models
Europarl Corpus of Native, Non-native and Translated Texts - ENNTT
A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus
Romanian Determiners Lexicon - RoDetLexicon 1.1
Comparing Speech and Text Classification of Native and Non-native English
Degrees of Similarity Between Romanian and Related Languages
Cross-lingual Named Entity Recognition
A Computational Exploration of Pejorative Language in Social Media
multiRedditDep: a Multimodal Dataset for Depression Detection from Social Media

Resources and Tools for Computational Historical Linguistics

The Java code for automatically identifying and producing related words for historical linguistics: link.
The translations used for dictionary-based identification of cognates: link.
The input and output files for experiments on identification and production of relate words: link.

RALS: Resources and Baselines for Romanian Automatic Lexical Simplification

This repository accompanies the paper “RALS: Resources and Baselines for Romanian Automatic Lexical Simplification”, which introduces the first set of resources and baseline systems for Lexical Complexity Prediction (LCP) and Lexical Simplification (LS) in Romanian. The project provides new datasets, new evaluation protocols, and several baseline systems.

Neural Text Simplification Models

We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated methods, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve good grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems.
Follow the steps, in order to generate simplified text:
1. Checkout our repository including the submodules: git clone --recursive https://github.com/senisioi/NeuralTextSimplification.git
2. Download the pre-trained released models NTS and NTS-w2v (this may take a while): python src/download_models.py ./models
3. Run translate.sh from the scripts dir: cd src/scripts && ./translate.sh

Europarl Corpus of Native, Non-native and Translated Texts - ENNTT

A complete description of this resource is available here: A Corpus of Native, Non-native and Translated Texts, LREC, 2016, PDF
For the raw corpus, please check the dataset available here
For the experiments presented in the ACL 2016 paper, please check the dataset available here
For the experiments presented in the LREC 2016 paper, please check the dataset available here

Short description:

This is a monolingual English corpus of native, non-native and (human) translated texts extracted from the European Parliament. The translated texts from different source languages represent a subset of the Haifa Corpus of Translationese. We preserved the same annotation style and included an ID and the EU state that each member of the European Parliament represents.
We hope this dataset will facilitate a unified comparative study of translations and language produced by highly fluent non-native speakers, two closely-related phenomena that have only been studied in isolation so far.

A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus

This work is the result of our collaboration with Anca Bucur, Ph.D. candidate, from the Center of Excellence in Image Study.
We compile a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-Philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.

Comparing Speech and Text Classification of Native and Non-native English

We provide a comparison of speech and text classification of native and non-native English using a subset of the International Corpus Network of Asian Learners of English (ICNALE)
The analysis is reported in the paper Nisioi, S., Comparing Speech and Text Classification on ICNALE, LREC 2016

Romanian Determiners Lexicon - RoDetLexicon 1.1

The first version of Romanian Determiners Lexicon (RoDetLexicon 1.1) specifies the relevant features for determiners studied so far during the research project “The structure and interpretation of Romanian Determiner Phrase in Discourse Representation Theory: the determiners”. The importance of determiners comes from both syntax and semantics. From the point of view of syntactic theory, specifying the determiner’s relevant features naturally leads to the determination of the parameters of syntactic variation in the Determiner Phrase domain. From the discursive perspective, determinants have a fundamental role, being the most important constituents when it comes to establishing the logical structure of the sentence or of the discourse.
The feature matrix of each determiner contains morpho-syntactic and semantic features, as they emerged from the studies developed during the project, such as: syntactic category, selectional features, phi-features (person, number, gender), definiteness, quantificational features, cardinality, focus, topic, deixis, proximity, contrastive, location, anaphoric, cataphoric or classifier.
More details are available this paper.

Degrees of Similarity Between Romanian and Related Languages

More details are available in Ciobanu, A.M. and Dinu, L.P., An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian, EMNLP 2014 PDF

Cross-lingual Named Entity Recognition

Experiments on named entity translation using word embeddings are described in Şulea, O. M., Nisioi, S., and Dinu, L. P.,:, Using Word Embeddings to Translate Named Entities, LREC2016
This resource is an annotated parallel corpus of named entities, currently work in progress

A Computational Exploration of Pejorative Language in Social Media

More details about this resource can be found in Dinu, L. P., Iordache, I. B., Uban, A. S., Zampieri, M.: A Computational Exploration of Pejorative Language in Social Media, Findings of the Association for Computational Linguistics: EMNLP 2021.
the dataset can be downloaded via this link.

multiRedditDep: a Multimodal Dataset for Depression Detection on Social Media

More details about this resource can be found in Uban, Ana-Sabina, Berta Chulvi, and Paolo Rosso. “Explainability of depression detection on social media: From deep learning models to psychological interpretations and multimodality.” Early Detection of Mental Health Disorders by Social Media Monitoring: The First Five Years of the eRisk Project. Cham: Springer International Publishing, 2022. 289-320.
Please email Ana Uban for access to the dataset.