InstRead: Research Instruments for Text Complexity, Simplification and Readability Assessment
Project PN-IV-P2-2.1-TE-2023-2007, funded by the Romanian National Authority for Scientific Research and Innovation, UEFISCDI: “Research Instruments for Text Complexity, Simplification and Readability Assessment”.
Scientific Project Report 1
Abstract
We develop the first set of instruments for creating simplified texts by assessing lexical complexity and readability in Romanian. Our goals are to reduce the gap in this research field compared with other languages and to propose new methods inspired by recent advances in Large Language Models (LLMs) for these tasks. Our project aims to:
- build and collect a corpus of lexical complexity assessments provided by young adult native Romanian speakers
- offer a statistical analysis of the annotations, comparing different text genres and linguistic features
- train and evaluate deep learning algorithms by leveraging LLMs and compare them with traditional methods
- develop a set of tools on the project’s website that can be used to evaluate lexical complexity, assess readability, or simplify new documents.
The main scientific contributions of this project consist of releasing modern readability resources for Romanian to the general public, thereby initiating the development of this field in the local context, reducing the research gap with other well-studied languages, and enabling new interdisciplinary collaborations for future research on text complexity.
Team
Core Team
- Sergiu Nisioi, Principal Investigator, PhD
- Claudiu Creangă, PhD candidate at the Interdisciplinary School of Doctoral Studies, University of Bucharest
- Ana Sabina Uban, PhD , Assoc. Prof. at the Faculty of Mathematics and Computer Science, University of Bucharest
- Adina Camelia Bleotu, PhD , Assistant Prof. at the Faculty of Foreign Languages and Literatures
- Mihai Dascălu, PhD , Full Prof. at the National University of Science and Technology POLITEHNICA
- Bogdan Mustață, PhD , CINETic Laboratory, Assoc. Prof. at the I. L. Caragiale National University of Theatre and Film
Partners: Psychological Research and Professional Training Laboratory
- Adrian Luca, PhD, Faculty of Psychology and Educational Studies
- Filip Popovici, PhD, Faculty of Psychology and Educational Studies
- Constantin Vasile, PhD, Faculty of Psychology and Educational Studies
Students and Research Assistants
- Oleksandra Kuvshynova, MSc
- Mircea Marin, MSc
- Rareș Roșcan, BSc
- Rareș Cocoșilă, BSc
- Anamaria Hodivoianu, MSc
- Rareș Păpușoi, MSc
- Petru Theodor Cristea, MSc
- Cristina Popescu, BSc
- Fabian Anghel, MSc
- Mihai Grigore, MSc
- Anastasia Ștefănescu, MSc
- Teodora Ioana Nae, BSc
- Teodor-Filip Leahu, BSc
Resources
Project Objectives for 2025 (Stage 1)
| Type | Name | Details |
|---|---|---|
| Stage 1 | Creation of Annotated Corpus Eye-tracking Data Recording Lexical Complexity Prediction |
|
| Act. 1.1 | Creation of Corpus with Explicit Annotations | |
| Part. 1.1.1: | Text selection, participant selection, corpus acquisition, dissemination | Data annotation Selecting participants for the study Collecting and preparing texts to be annotated Preparing guide, legal consent, setting-up the annotation platform |
| Act. 1.2 | Creation of Corpus with Implicit Annotations from Eye Tracking | |
| Part. 1.2.1: | Text selection, participant selection, corpus acquisition, dissemination | Laboratory data annotation Preparing texts in the appropriate format for eye tracking Device calibration, internal test with members of our team Recurrent meetings with participants to record data |
| Act. 1.3 | Post-processing and data analysis | |
| Part. 1.3.1: | Data validation, post-processing and unification of data extracted from eye tracking and explicit annotations, dissemination |
Data collection, analysis, and reliability testing |
| Act. 1.4 | Lexical Complexity Prediction (LCP) | |
| Part. 1.4.1: | Creation of machine-learning models for LCP based on collected data, dissemination | Training ML models for LCP Developing and implementing algorithms for automatic LCP based on traditional machine learning approaches Quantitative and qualitative evaluation Comparison with datasets on all available languages Romanian, Spanish, French, English Deep learning and LLMs for LCP Create annotations using LLMs Verify self-attention, neural activations, and information flows in LLMs for LCP Qualitative and quantitative evaluation across available languages (Romanian, Spanish, French, English) |
Publications
- Anghel, Fabian, Cristea, Petru-Theodor, Creangă, Claudiu, & Nisioi, Sergiu. “RALS: Resources and Baselines for Romanian Automatic Lexical Simplification.” In (SAC Highlights Award) Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 31469–31480. EMNLP, 2025.
- Krakowczyk, Daniel G., Reich, David R., Säuberli, Andreas, Škrjanec, Iza, Cretton, Isabelle CR, Jakobi, Deborah N., Nisioi, Sergiu, Prasse, Paul, & Jäger, Lena A. “The More the Merrier: Boost Your Dataset Visibility and Discover Eye-Tracking Datasets with pymovements.” In Proceedings of the 2025 Symposium on Eye Tracking Research and Applications, pp. 1–3. 2025. https://pymovements.readthedocs.io/en/stable/
- Bleotu, Adina Camelia, Foucault, Deborah, Roeper, Tom, & Lakshmanan, Usha. (2025). The role of Universal Grammar and crosslinguistic influence in the interpretation of recursive set-subset adjectives in adult Romanian L1 English–L2 bilinguals. Frontiers in Human Neuroscience, 19, 1537488. https://doi.org/10.3389/fnhum.2025.1537488
- Jerpelea, Alexandru-Iulius, Rădoi, Alina, & Nisioi, Sergiu. “Dialectal and low resource machine translation for Aromanian.” In Proceedings of the 31st International Conference on Computational Linguistics, pp. 7209–7228. 2025.
- Roșcan, Rareș-Alexandru, & Nisioi, Sergiu. “Archaeology at TSAR 2025 Shared Task Teaching Small Models to do CEFR Simplifications.” In Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025), pp. 251–260. 2025, workshop at EMNLP2025.
- Păpușoi, Rareș, & Nisioi, Sergiu. “A Comparison of Elementary Baselines for BabyLM.” In Proceedings of the First BabyLM Workshop, pp. 218–225. 2025, workshop at EMNLP2025.
- Crivoi, Carla, & Uban, Ana Sabina. “SciBERT Meets Contrastive Learning: A Solution for Scientific Hallucination Detection.” In Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025), pp. 336–343. 2025, part of ACL2025.
- Hodivoianu, Anamaria, Kuvshynova, Oleksandra, Popovici, Filip, Luca, Adrian, & Nisioi, Sergiu. “Predicting Total Reading Time Using Romanian Eye-Tracking Data.” Gaze4NLP: The First International Workshop on Gaze Data and Natural Language Processing, 2025, part of RANLP2025.
- Popescu, Cristina Maria, & Nisioi, Sergiu. “Exploring Mouse Tracking for Reading on Romanian Data.” Paper presented at The 3rd Workshop on Eye Movements and the Assessment of Reading Comprehension, Stuttgart, 2025, part of RANLP2025.
- Creangă, Claudiu, Marchitan, Teodor-George, & Dinu, Liviu P. “Team Unibuc-NLP at SemEval-2025 Task 11: Few-shot text-based emotion detection.” In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pp. 468–475. 2025, co-located with ACL, Vienna, Austria.
- Ghetoiu, Laurențiu G., & Nisioi, Sergiu. “Graph-based RAG for Low-Resource Aromanian–Romanian Translation.” In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing – Natural Language Processing in the Generative AI Era, pp. 388–394. Varna, Bulgaria: INCOMA Ltd., Shoumen, Bulgaria, September 2025.
- Hirica, Ioan Alexandru, Tabusca, Stefana Arina, & Nisioi, Sergiu. “Arabic to Romanian Machine Translation: A Case Study on Distant Language Pairs.” In Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing – Natural Language Processing in the Generative AI Era, pp. 423–432. Varna, Bulgaria: INCOMA Ltd., Shoumen, Bulgaria, September 2025.
Presentations
- Hodivoianu, Anamaria, Kuvshynova, Oleksandra, Marin, Mircea, & Nisioi, Sergiu. “Using Eye Tracking Data for Lexical Simplification.” The 3rd Workshop on Eye Movements and the Assessment of Reading Comprehension, Stuttgart, 2025.
- 4 December 2025: Knall, Anna, Foucault, Deborah, & Bleotu, Adina Camelia. “Deictic and Anaphoric NP Reconstruction in Child Romanian. The Role of Gender in Visual Contexts.” Talk to be presented at the Workshop on Gender in Romance Languages, Going Romance 2025, Ca’ Foscari University, Venice.
- 21 November 2025: Bleotu, Adina Camelia. “Visual priming and the internal structure of denominals.” Talk presented at the Workshop on Lexical Representations and Phonological Representations, International Conference of the Faculty of Foreign Languages and Literatures, University of Bucharest.
- 6–9 November 2025: Bleotu, Adina Camelia, Benz, Anton, Foucault, Deborah, Tieu, Lyn, & Roeper, Tom. “Acquiring conditional disjunction: Romanian five-year-olds’ struggle with implicit ‘if not’.” Poster presented at the 50th Boston University Conference on Language Development (BUCLD 50), Boston University.
- 17–19 September 2025: Bleotu, Adina Camelia, Nicolae, Andreea, Benz, Anton, & Tieu, Lyn. “Conjunction as a basic meaning of disjunction: Evidence from Romanian 3-year-olds.” Talk presented at the 11th Experimental Pragmatics Conference (XPRAG 2025), University of Cambridge.
- 17–19 September 2025: Bleotu, Adina Camelia, Foucault, Deborah, Roeper, Thomas, Tieu, Lyn, & Benz, Anton. “Insights into the acquisition of conditional disjunction.” Poster (alternate oral presentation) presented at the 11th Experimental Pragmatics Conference (XPRAG 2025), University of Cambridge.
- 11 November 2025: Bleotu, Adina Camelia. “Go to bed early or you’ll wake up tired! How do 5-year-olds handle conditional disjunction?” (based on joint work with Foucault, Deborah; Benz, Anton; Tieu, Lyn; & Roeper, Tom). Invited talk at the Language Acquisition Lab, UMass Amherst.
- 11 September 2025 (Invited Talk): Bleotu, Adina Camelia. “Conjunction as a default meaning of disjunction.” Invited talk at the Workshop on Implicatures, part of the 15th International Tbilisi Symposium on Language, Logic and Computation (TbILLC 2025), co-organized by Milica Denić, Sarah Zobel, & Maria Aloni. https://www.marialoni.org//ImplicaturesWorkshop25
Human Language Technologies Research Center