Learning language(s)

Learning language(s) How humans do and machines might.	Version	🧮.8
Updated	Jan ’26
Author	Bastian Bunzeck	License	MIT

Blog

I am thinking about starting a blog. Content (probably) coming soon!

About me

Hi! My name is Bastian Bunzeck and I am a third year PhD student at Bielefeld University. I work in the Computational Linguistics group (CLAUSE) under the supervision of Prof. Sina Zarrieß. I am also a member of the collaborative research center (CRC) 1646 – Linguistic Creativity in Communication in Bielefeld. Before my PhD, I studied English/American Studies and Computer Science at Friedrich Schiller University Jena in Germany and Katholieke Universiteit Leuven in Belgium. In Jena I helped to develop the corpus annotation tool Hexatomic and also worked at the English department.

I am interested in the relationship between (especially usage-based and cognitive approaches to) linguistics on the one hand, and natural language processing on the other hand. The neural turn in NLP has realized many ideas already proposed much earlier in the literature on connectionist modelling. Yet, it remains elusive how well state-of-the-art models and the cognitive/linguistic reality actually map to one another. In my research, I explore the ways in which linguistic knowledge emerges in human language learners and neural language models – mostly from a usage-based and constructionist perspective. Currently, my research focus in this direction lies on very small language models trained with small amounts of data, and their comparability to child language development – lately also from a multilingual perspective!

If you are looking for ways to contact me, check out my page in the Bielefeld University staff directory or send me an email (firstname.lastname@uni-bielefeld.de).

Publications

For up-to-date overviews also check: Google Scholar, PUB - Publications at Bielefeld University and my ORCID page.

Preprints

Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, Bhargav Shandilya, Diana Galvan-Sosa, Faiz Ghifari Haznitrama, Francesca Padovani, Francois Meyer, Hai Hu, Julen Etxaniz, Laurent Prévot, Linyang He, María Grandury, Mila Marcheva, Negar Foroutan, Nikitas Theodoropoulos, Pouya Sadeghi, Siyuan Song, Suchir Salhan, Susana Zhou, Yurii Paniv, Ziyin Zhang, Arianna Bisazza, Alex Warstadt, and Leshem Choshen. 2025. BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data. https://arxiv.org/abs/2510.10159

Conference/Workshop Papers

Francesca Padovani, Bastian Bunzeck, Manar Ali, Omar Momen, Arianna Bisazza, Hendrik Buschmeier, and Sina Zarrieß. 2025. Dialogue is not enough to make a communicative BabyLM (but neither is developmentally inspired reinforcement learning). In Proceedings of the First BabyLM Workshop, pages 421–435, Suzhou, China. Association for Computational Linguistics. https://aclanthology.org/2025.babylm-main.29/
Bastian Bunzeck, Daniel Duran, and Sina Zarrieß. 2025. Do Construction Distributions Shape Formal Language Learning In German BabyLMs?. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 169–186, Vienna, Austria. Association for Computational Linguistics. https://aclanthology.org/2025.conll-1.12/
Bastian Bunzeck and Sina Zarrieß. 2025. Subword models struggle with word learning, but surprisal hides it. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 286–300, Vienna, Austria. Association for Computational Linguistics. https://aclanthology.org/2025.acl-short.24/
Bastian Bunzeck, Daniel Duran, Leonie Schade, and Sina Zarrieß. 2025. Small language models also work with small vocabularies: Probing the linguistic abilities of grapheme- and phoneme-based baby llamas. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6039–6048, Abu Dhabi, UAE. Association for Computational Linguistics. https://aclanthology.org/2025.coling-main.404/
Bastian Bunzeck, Daniel Duran, Leonie Schade, and Sina Zarrieß. 2024. Graphemes vs. phonemes: battling it out in character-based language models. In The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning, pages 54–64, Miami, FL, USA. Association for Computational Linguistics. https://aclanthology.org/2024.conll-babylm.5/
Bastian Bunzeck and Sina Zarrieß. 2024. The SlayQA benchmark of social reasoning: Testing gender-inclusive generalization with neopronouns. In Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP, pages 42–53, Miami, Florida, USA. Association for Computational Linguistics. https://aclanthology.org/2024.genbench-1.3/
Bastian Bunzeck and Sina Zarrieß. 2024. Fifty shapes of BLiMP: Syntactic learning curves in language models are not uniform, but sometimes unruly. In Proceedings of the 2024 CLASP Conference on Multimodality and Interaction in Language Learning, pages 39–55, Gothenburg, Sweden. Association for Computational Linguistics. https://aclanthology.org/2024.clasp-1.7/
Bastian Bunzeck and Sina Zarrieß. 2023. GPT-wee: How small can a small language model really get? In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 7–18, Singapore. Association for Computational Linguistics. https://aclanthology.org/2023.conll-babylm.2/
Bastian Bunzeck and Sina Zarrieß. 2023. Entrenchment matters: Investigating positional and constructional sensitivity in small and large language models. In Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD), pages 25–37, Gothenburg, Sweden. Association for Computational Linguistics. https://aclanthology.org/2023.clasp-1.3

Journal papers

Bastian Bunzeck and Holger Diessel. 2025. The richness of the stimulus: Constructional variation and development in child-directed speech. First Language, 45(2):152–176. https://doi.org/10.1177/01427237241303225
Paula Wojcik, Bastian Bunzeck, and Sina Zarrieß. 2023. The Wikipedia Republic of Literary Characters. Journal of Cultural Analytics, 8(2). https://doi.org/10.22148/001c.70251
Stephan Druskat, Thomas Krause, Clara Lachenmaier, and Bastian Bunzeck. 2023. Hexatomic: An extensible, OS-independent platform for deep multi-layer linguistic annotation of corpora. Journal of Open Source Software, 8(86):4825. https://doi.org/10.21105/joss.04825

Miscellaneous

Bastian Bunzeck and Stefan Hartmann. 2025. Thomas Herbst and Thomas Hoffmann, A Construction Grammar of the English language: CASA – a constructionist approach to syntactic analysis (Cognitive Linguistics in Practice 5). Amsterdam and Philadelphia: John Benjamins, 2024. Pp. xvi + 315. ISBN 9789027214980. English Language and Linguistics, pages 1–5. https://doi.org/10.1017/S1360674325100506

Talks and presentations

2025

Do Construction Distributions Shape Formal Language Learning In German BabyLMs?, (non-archival poster presentation), The Second International Workshop on Construction Grammars and NLP (CxGs+NLP 2025), Düsseldorf (Germany)
Developmentally plausible pretraining, now also auf Deutsch: a BabyLM Dataset for German, (non-archival poster presentation), KONVENS 2025, University of Hildesheim (Germany)
Child-directed speech is fine-tuned to children’s developmental needs, (peer-reviewed poster presentation), Bialogue 2025 – The 29th Workshop on the Semantics and Pragmatics of Dialogue, Bielefeld University (Germany)
What LLMs can do for linguistics…and what linguistics can do for LLMs, (invited guest lecture, seminar on empirical linguistics), Heinrich Heine Universität Düsseldorf (Germany)
Word learning in LMs: A trilogy in four parts, (oral presentation), 1st RTG SFB 1646 & Friends Symposium, Bielefeld University (Germany)
Word learning in (all kinds of) German and English BabyLMs, (poster presentation), HumanCLAIM Workshop, University of Göttingen (Germany)

2024

Fifty shapes of BLiMP: syntactic learning curves in language models are not uniform, but sometimes unruly, (non-archival poster presentation), BlackboxNLP 2024 at EMNLP 2024, Miami/Florida (US)
Constructions in child-directed speech (with Holger Diessel), (peer-reviewed oral presentation), 10th International Conference of the German Cognitive Linguistics Association, Osnabrück University (Germany)
Generating authentic child speech from little data, (poster presentation), NLG in the Lowlands 2024, Bielefeld University (Germany)

2023

GPT-wee: Experiments in downscaling and curriculum learning, (poster presentation), SAIL Workshop on Fundamental Limits of Large Language Models, Bielefeld University (Germany)
From Byte to Babel: Large Language Models and the Tower of Linguistic Knowledge, (peer-reviewed oral presentation), META-LING 2023 - Methodological Exploration and Technological Advances in Linguistics, University of Bamberg (Germany)
Where and How Do Literary Characters Figure in Wikipedia? (with Sina Zarrieß), (invited presentation), International Workshop | Wikipedia, Wikidata and Wikibase: Usage Scenarios for Literary Studies, Free University of Berlin (Germany)

Teaching

Winter term 2025/2026

Introduction to computational linguistics (Einführung in die Computerlinguistik) – practical sessions, accompanying lectures by Sina Zarrieß
Methods of applied computational linguistics (Methoden der angewandten Computerlinguistik) – lectures and practical sessions, taught jointly with Sina Zarrieß

Summer term 2025

Neural nets in language technology – seminar (taught in English)

Winter term 2024/2025

Introduction to computational linguistics (Einführung in die Computerlinguistik) – practical sessions, accompanying lectures by Sina Zarrieß
Methods of applied computational linguistics (Methoden der angewandten Computerlinguistik) – lectures and practical sessions

Summer term 2024

Methods of applied computational linguistics (Methoden der angewandten Computerlinguistik) – lectures and practical sessions
Neural nets in language technology (Neuronale Netze in der Sprachverarbeitung) – practical sessions, accompanying lectures by Sina Zarrieß

Winter term 2023/2024

Introduction to computational linguistics (Einführung in die Computerlinguistik) – practical sessions, accompanying lectures by Sina Zarrieß
Project seminar: Modeling and analysis of dialogue (Projektseminar: Modellierung und Analyse von sprachlichen Dialogen), taught jointly with Simeon Schüz

Summer term 2023

Methods of applied computational linguistics (Methoden der angewandten Computerlinguistik) – practical sessions, accompanying lectures by Sina Zarrieß

Design adapted from Oskar Wickström’s The Monospace Web