CoToHiLi: Computational Tools for Historical Linguistics

Project PN-III-P4-ID-PCE-2020-1544, funded by the Romanian National Authority for Scientific Research and Innovation, UEFISCDI: “Dezvoltarea de sisteme automate suport pentru lingvistica istorică”.

Scientific Project Report

Abstract

This project represents a computational framework for historical linguistics (“Computational Tools for Historical Linguistics” – CoToHiLi). The general purpose of the CoToHiLi project is to integrate expert knowledge and computational power to address the following topics: cognate identification, cognate-borrowing discrimination, Latin protoword reconstruction and semantic divergence. The goal of the project is twofold: 1) to automate certain parts of the traditional work-flow of the comparative method (such as the collection and selection of valid data, the initial pre-processing, or the automatic alignment based on predefined or inferred rules), and 2) to bring new insights or avenues of investigation, which might not be easily accessible otherwise (for example, the automatic identification of patterns and regularities in large amounts of data). The project is focused on the Romance languages, and will provide tools for the main Romance kernel group: Romanian, Italian, French, Spanish, Portuguese, including, of course, the mother-tongue, Latin. Nonetheless, we envision that the methodologies and computational tools proposed by the CoToHiLi project will also serve as a basis for further development for other comparable language families, including less studied languages, with scarce resources available.

Principal investigator

Members

Project objective for 2021: Related word analysis

To achieve this goal, the following activities were planned and executed:

Activity 1.1: Analysis and inspection of existing cognate resources in Romance languages ​​(Ro, It, Es, Fr, It, Pt)

Activity 1.2: Design and construction of the database of cognate pairs for Romance languages

Activity 1.3: Analysis, design and development of computer-assisted tools for detecting cognate pairs

Activity 1.4: Analysis and inspection of borrowed word resources and their harmonization

Project objective for 2022: Borrowing detection and proto-word reconstruction

To achieve this goal, the following activities were planned and executed:

Activity 2.1:Designing and building a database of borrowings for Romance languages

Activity 2.2: Analysis, design and development of computer-assisted tools for detecting borrowings - analysis and design of appropriate methods - identifying optimal parameters for the models - testing, evaluation, result improvement and refinement - dissemination

Activity 2.3: Analysis of existing resources and designing new adequate resources for latin proto-word reconstruction

Activity 2.4: Analysis, design and development of computer-assisted tools for detecting latin proto-words

Project objective for 2023: Analysis of semantic divergence and model evaluation.

To achieve this goal, the following activities were planned and executed:

Activity 3.1: Analyzing and enriching relevant resources and reviewing existing methods and problems in semantic divergence detection

Activity 3.2: Computer-assisted analysis of semantic change of cognate words in the considered languages

Activity 3.3: Identifying and analyzing statistical and linguistic patterns present in words which suffered semantic changes in the considered languages

Activity 3.4: Finalizing models, analyzing models and evaluating results, dissemination of results and identifying new research directions

Articles

  1. Simona Georgescu, 2023 “Organigrama semántico de una familia etimológica: el esp. maca ‘señal que queda en la fruta por algún daño recibido’ y sus cognados románicos”, Cuadernos del Instituto Historia de la Lengua, 16/2023
  2. Simona Georgescu, Alina Maria Cristea, Anca Dinu, Bogdan Iordache, Simona Georgescu, Ana Sabina Uban, Laurențiu Zoicaș, 2023. “Resurse digitale pentru analiza lexicului de origine latină în limbile romanice”, in Studii și cercetări lingvistice, p 65-76, LXXIV (1).
  3. Liviu P Dinu, Alina Maria Cristea, Anca Dinu, Simona Georgescu, Bogdan Iordache, Ana Sabina Uban, Laurențiu Zoicaș. “Computational Approaches for Romance Related Words Discrimination”. The 26th International Conference on Historical Linguistics (ICHL 2023), Heidelberg, Germany, September 4-8 2023 (abstract, oral presentation)
  4. Liviu P. Dinu, Ana Uban, Alina Maria Cristea, Anca Dinu, Bogdan Iordache, Simona Georgescu, Laurențiu Zoicaș, 2023. “RoBoCoP: A Comprehensive ROmance BOrrowing COgnate Package and Benchmark for Multilingual Cognate Identification.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023).
  5. Liviu P Dinu, Ioan-Bogdan Iordache, and Ana Sabina Uban. “CoToHiLi at SIGTYP 2023: Ensemble Models for Cognate and Derivative Words Detection.” Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (co-located with EACL 2023), Dubrovnik, Croatia. Association for Computational Linguistics. 2023.
  6. Alina Maria Cristea, Anca Dinu, Liviu P. Dinu, Simona Georgescu, Ana Sabina Uban, Laurențiu Zoicaș, 2022. CoToHiLi at LSCDiscovery: the Role of Linguistic Features in Predicting Semantic Change. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change (LChange @ ACL 2022), pages 187-192, May 26-27, 2022, Dublin, Ireland. [PDF]
  7. Alina Maria Cristea, Anca Dinu, Liviu P Dinu, Simona Georgescu, Ana Uban, Laurențiu Zoicaș, 2022. CoToHiLi: Computational Tools for Historical Linguistics. In Proceedings of the 38th Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD 2022 @ SEPLN 2022), pages 31-34, September 21-23, 2022, A Coruña, Spain. [PDF]
  8. Alina Maria Cristea, Anca Dinu, Liviu P. Dinu, Simona Georgescu, Ana Sabina Uban, Laurențiu Zoicaș, 2022. A semantic change time-lapse for Romance languages and English. The 25th International Conference on Historical Linguistics (ICHL25), August 1-5, 2022, Oxford, UK (abstract, oral presentation).
  9. Alina Maria Cristea, Anca Dinu, Liviu P. Dinu, Simona Georgescu, Ana Sabina Uban, Laurențiu Zoicaș, 2022. Computational approaches for protoword reconstruction. The 25th International Conference on Historical Linguistics (ICHL25), August 1-5, 2022, Oxford, UK (abstract, oral presentation).
  10. Anca Dinu, Dan Ioan Dobre, Andreea-Codrina Moldovan and Elena-Daniela Nicolescu, Computational Analysis and Author Detection for Political Discourses of Romanian Presidents, în Anca Dinu, Madălina Chitez, Mihnea Dobre and Liviu P. Dinu (eds.) Recent Advances în Digital Humanities: Romance Language Applications, Peter Lang, pp 194-214, 2022.
  11. Sergiu Nisioi, Ana Sabina Uban and Liviu P. Dinu, 2022. Identifying Source-language Dialects in Translation. Mathematics 2022, Special Issue on Natural Language Processing (NLP) and Machine Learning (ML) - Theory and Applications.
  12. Alina Maria Cristea, Liviu P. Dinu, Simona Georgescu, Mihnea-Lucian Mihai, Ana Sabina Uban, 2021. Automatic Discrimination between Inherited and Borrowed Latin Words in Romance Languages. In Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings 2021), pages 2845–2855, Dominican Republic. [PDF]
  13. Simona Georgescu, Alina Maria Cristea, Anca Dinu, Liviu P. Dinu, Ana Sabina Uban, Laurențiu Zoicaș, 2021. Herramientas computacionalespara el análisis del léxico de origen latino en inglés y en las lenguas románicas. In Proceedings Congreso Internacional “Ciencia, Tecnología y Lenguajes”, Universidad Complutense de Madrid, July 1-2, 2021.
  14. Liviu P. Dinu, Ioan-Bogdan Iordache, Ana Sabina Uban, Marcos Zampieri, 2021. A Computational Exploration of Pejorative Language in Social Media. In Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings 2021), Dominican Republic. [PDF]
  15. Alina Maria Cristea, Anca Dinu, Liviu P. Dinu, Simona Georgescu, Ana Sabina Uban, Laurențiu Zoicaș, 2021. Towards an Etymological Map of Romanian. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2021), pages 315-324, September 1–3, 2021. [PDF]
  16. Anca Dinu, Andreea-Codrina Moldovan, 2021. Automatic Detection and Classification of Mental Illnesses from General Social Media Texts. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2021), pages 358–366, September 1–3, 2021.
  17. Ana Sabina Uban, Alina Maria Cristea, Anca Dinu, Liviu P. Dinu, Simona Georgescu, Laurențiu Zoicaș, 2021. Tracking Semantic Change in Cognate Sets for English and Romance Languages. In Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change (LChange @ ACL-IJCNLP 2021), pages 64–74, Bangkok, Thailand (online). [PDF]
  18. Ana Sabina Uban, Cornelia Caragea, Liviu Dinu, 2021. Studying the Evolution of Scientific Topics and their Relationships. In Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Findings (ACL-IJCNLP Findings 2021), Bangkok, Thailand (online). [PDF]
  19. Ana Uban, Liviu P Dinu, 2020. Automatically Building a Multilingual Lexicon of False Friends With No Supervision.* In Proceedings of LREC 2020. [PDF]
  20. Alina Maria Ciobanu, Liviu P. Dinu, Laurențiu Zoicaș, 2020. Automatic Reconstruction of Missing Romanian Cognates and Unattested Latin Words.* In Proceedings of LREC 2020. [PDF]
  21. Alina Maria Ciobanu, Liviu P. Dinu, 2019. Automatic Identification and Production of Related Words for Historical Linguistics.* In Computational Linguistics, 45(4), 667–704.
  22. Ana Uban, Alina Maria Ciobanu, Liviu P. Dinu, 2019. Studying Laws of Semantic Divergence across Languages using Cognate Sets.* In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change (LChange @ ACL 2019). [PDF]
  23. Ana Uban, Alina Maria Ciobanu, Liviu P. Dinu, 2019. A Computational Approach to Measuring the Semantic Divergence of Cognates.* In Proceedings of CICLING 2019.
  24. Alina Maria Ciobanu, Liviu P. Dinu, 2018. Ab Initio: Automatic Latin Proto-word Reconstruction.* In Proceedings of COLING 2018, 1604-1614. [PDF]
  25. Alina Maria Ciobanu, Liviu P. Dinu, 2015. Automatic Discrimination between Cognates and Borrowings.* In Proceedings of the 53nd Annual Meeting of the Association for Computational Linguistics (ACL (2) 2015), pages 431-437, July 26-31, 2015, Beijing, China. [PDF]
  26. Alina Maria Ciobanu, Liviu P. Dinu, 2014. An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian.* In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pages 1047-1058, October 25–29, 2014, Doha, Qatar. [PDF]
  27. Alina Maria Ciobanu, Liviu P. Dinu, 2014. Automatic Detection of Cognates Using Orthographic Alignment.* In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL (2) 2014), pages 99-105, June 22-27, 2014, Baltimore, MD, USA. [PDF]
  28. Alina Maria Ciobanu, Liviu P. Dinu, 2014. Building a Dataset of Multilingual Cognates for the Romanian Lexicon.* In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), pages 1038-1043, May 26-31 2014, Reykjavik, Iceland. [PDF]
  29. Alina Maria Ciobanu, Liviu P. Dinu, 2013. A Dictionary-Based Approach for Evaluating Orthographic Methods in Cognates Identification.* In Proceedings of Recent Advances in Natural Language Processing (RANLP 2013), pages 141–147, September 7-13, 2013, Hissar, Bulgaria. [PDF]

*Published before the beginning of the project

Chapters in Books

  1. Liviu P Dinu, Alina Cristea, Anca Dinu, Simona Georgescu, Andrea Sgarro, Ana Uban and Laurențiu Zoicas, On the asymmetric intelligibility between Italian and Romanian. In G. Alboiu, D. Isac, A. Nicolae, M. Tănase-Dogaru, A. Tigău (eds.) A LIFE IN LINGUISTICS. A Festschrift for Alexandra Cornilescu on her 75th birthday, p. 207-216, Bucharest University Press, 2022 (ISBN 978-606-16-1355-7).
  2. Simona Georgescu, Alina Maria Cristea, Anca Dinu, Liviu P Dinu, Ana Uban, Laurențiu Zoicaș, 2022. Herramientas computacionales para el análisis del léxico de origen latino en inglés y en las lenguas románicas / Computing Tools for the Analysis of the Lexicon of Latin Origin în English and în the Romance Languages. In Universalidad y multiversalidad en literatura, lengua y traducción, p 456-467, Comares, Granada, Spania.
  3. Ana Uban, Alina Maria Ciobanu, Liviu P Dinu, 2021. Cross-lingual laws of semantic change. In Nina Tahmasebi, Lars Borin, Adam Jatowt, Yang Xu, Simon Hengchen, editors, Computational Approaches to Semantic Change. Berlin: Language Science Press, pages 219-260, 2021.
  4. Alina Cristea, Anca Dinu, Liviu P. Dinu, Simona Georgescu, Ana Uban. Computer-assisted methods in historical linguistics. In Mihai Dascălu , Bogdan Șandric (coordinators). Heritage in the digital era. Cases and Best Practices from Romania, (ISBN 978-606-26-1486-7) Bucharest: Editura Pro Universitaria, p.41-57, 2021.

Books

  1. Recent Advances în Digital Humanities: Romance Language Applications, Anca Dinu, Madalina Chitez, Liviu Dinu and Mihnea Dobre (eds.), Peter Lang, (252 p), 2022
  2. Simona Georgescu, 2021. La regularidad en el cambio semantico. Las onomatopeyas en cuanto centrosde expansion en las lenguas romanicas. Editions de linguistique et de philologie, Strasbourg 2021.

Talks

  1. Simona Georgescu. „Rom. încă, it. anche, etc. : a new etymological insight”, Romance Linguistics Seminar, University of Cambridge), ianuarie 4, 2023, Cambridge UK.
  2. Liviu P Dinu. Computational Approaches for Romance Related Words Discrimination. 26th International Conference on Historical Linguistics, Heidelberg, Germany, 6 septembrie 2023
  3. Liviu P Dinu. Computational approaches to natural languages similarities. Universita degli studi di Modena e Reggio Emilia, 31 Martie 2023.
  4. Anca Dinu, Digital Humanities: una nuova rivoluzione tecnologica, Biblioteca Statale Stelio Crise di Trieste, Trieste, Italy, 9 May 2023.
  5. Anca Dinu, Measuring semantic change for Romance languages and English, Universita degli studi di Modena e Reggio Emilia, Reggio Emilia, 31 March 2023.
  6. Liviu P Dinu. On the Romanian evolution via computational approaches. Smart Diaspora 2023, Universitatea de Vest, Timisoara, 10 aprilie 2023.
  7. Liviu P Dinu. Sunt abordările computaționale soluții viabile pentru lingvistica istorică? Seminarul de cercetare al Facultatii de Limbi și Literaturi Straine, Universitatea din Bucuresti, 25 aprilie 2023.
  8. Liviu P Dinu. On the Romanian evolution via computational approaches. Smart Diaspora 2023, Universitatea de Vest, Timisoara, 11 aprilie, 2023.
  9. Liviu P Dinu. Marcus și etimologiile limbii romane. Universitatea Apollonia, Iași, 3 martie 2023
  10. Liviu P Dinu RoBoCoP@UniBuc (ROmance BOrrowings COgnates Package @ University of Bucharest). Conferinţa Anuală de Comunicare a Rezultatelor Cercetării la Universitatea din București, ediţia a II-a, 16 noiembrie 2023.
  11. Liviu P Dinu Computational tools and resources for Romance historical linguistics. Recent Advances in Digital Humanities (2nd edition), Universitatea de Vest, Timisoara, 17 noiembrie 2023.
  12. Liviu P Dinu Computational Approaches in Historical Linguistics (Keynote speaker). The 18th International Conference on Linguistic Resources and Tools for Natural Language Processing – ConsILR-2023 –/ Universitatea Transilvania, Brașov, 13 decembrie 2023.
  13. Liviu P. Dinu, 2023. Computer-assisted tools for generating and discriminating related words in historical linguistics. Oberseminar, University of Tubingen, Department of Linguistics, January 23, 2023
  14. Anca Dinu, 2022. A semantic change time-lapse for Romance languages and English. 25th International Conference on Historical Linguistics (ICHL25), August 2, 2022, Oxford, UK.
  15. Liviu P. Dinu. Computational approaches for protoword reconstruction. 25th International Conference on Historical Linguistics (ICHL25), August 2, 2022, Oxford, UK.
  16. Liviu P Dinu, 2022. NLP-based methods în deception detection. Corpul de control al guvernului, Guvernul României, sala Transilvania, November 3, 2022.
  17. Liviu P. Dinu, 2022. Computational Tools in Historical Linguistics for cognate detection, borrowing discrimination and protoword reconstruction. Cardamom Seminar, National University of Ireland Galway, October 31, 2022.
  18. Liviu P. Dinu, 2022. On the Romance languages similarity: a syllabic-based approach. Programa de Doctorado de Sistemas Inteligentes, UNED, Madrid, Spain, June 13, 2022.
  19. Liviu P. Dinu, 2022. An old-fashion investigator. Interdisciplinary School of Doctoral Studies, University of Bucharest, March 17, 2022.
  20. Marcus și schimbarea stilistică. Universitatea Apolonia, Iași, March 1, 2022.
  21. Liviu P. Dinu, 2021. Are computational approaches viable solutions for borrowing and semantic change problems? Invited talk, Working group Language variation, interaction, pragmatics, Language In The Human-Machine Era, Online, October 20, 2021.
  22. Simona Georgescu, 2021. Ce pot învăța lingviștii de la computere și computerele de la lingviști? Colocviul Internațional Discurs critic și variație lingvistică, “Abordări inter- și transdisciplinare ale trecutului și prezentului”, Universitatea din Suceava, July 8-9, 2021.
  23. Simona Georgescu, 2021. Herramientas computacionalespara la lingüística histórica. Congreso Internacional “Ciencia, Tecnología y Lenguajes”, Universidad Complutense de Madrid, July 1-2, 2021.
  24. Liviu P. Dinu, 2021. Marcus și timpurile sale. Facultatea de sociologie, Universitatea din Bucuresti, seria “Conceptualizări ale timpului în practica cercetării științifice. Dialoguri interdisciplinare”, June 10, 2021.
  25. Liviu P. Dinu, 2021. EthicAI, Goethe-Institut Bulgaria, EthicAI Linguistics workshop, June 8, 2021.
  26. Liviu P. Dinu, 2021. Etica si lingvistica computationala. Comisia Naționala a României pentru UNESCO, June 3, 2021.
  27. Liviu P. Dinu, 2021. Timpul și cuvintele. University of Bucharest, seria “Conceptualizări ale timpului în practica cercetării științifice. Dialoguri interdisciplinare”, May 13, 2021.
  28. Liviu P. Dinu, 2021.Cu un kil de carne de vacă nu mori de foame, cu un litru de vin nu mori de sete. Interdisciplinary School of Doctoral Studies, University of Bucharest, March 4, 2021.
  29. Liviu P. Dinu, 2021. From Classical to Computational Approaches in Historical Linguistics. Universitatea Apolonia, Iași, March 1, 2021.
  30. Liviu P. Dinu, 2021. O analiză computațională a discursului politic în Parlamentul European. Universitatea Apolonia, Iași, March 1, 2021.