Duolingo Research

Science powers our mission to make language education free and accessible to everyone.

About Us

With more than 300 million learners, Duolingo has the world's largest collection of language-learning data at its fingertips. This allows us to build unique systems, uncover new insights about the nature of language and learning, and apply existing theories at scales never before seen. We are also committed to sharing our data and findings with the broader research community.

Publications

  • NAACL 2018 • Duolingo Shared Task

    Second Language Acquisition Modeling

    We present the task of second language acquisition (SLA) modeling. Given a history of errors made by learners of a second language, the task is to predict errors that they are likely to make at arbitrary points in the future. We describe a large corpus of more than 7M words produced by more than 6k learners of English, Spanish, and French using Duolingo, a popular online language-learning app. Then we report on the results of a shared task challenge aimed studying the SLA task via this corpus, which attracted 15 teams and synthesized work from various fields including cognitive science, linguistics, and machine learning. ... Read more

    B. Settles, C. Brust, E. Gustafson, M. Hagiwara and N. Madnani
  • ACL 2016

    A Trainable Spaced Repetition Model for Language Learning

    We present half-life regression (HLR), a novel model for spaced repetition practice with applications to second language acquisition. HLR combines psycholinguistic theory with modern machine learning techniques, indirectly estimating the "half-life" of a word or concept in a student’s long-term memory. We use data from Duolingo — a popular online language learning application — to fit HLR models, reducing error by 45%+ compared to several baselines at predicting student recall rates. HLR model weights also shed light on which linguistic concepts are systematically challenging for second language learners. Finally, HLR was able to improve Duolingo daily student engagement by 12% in an operational user study. ... Read more

    B. Settles and B. Meeder
  • Cognitive Science 2016

    Self-directed Learning Favors Local, Rather Than Global, Uncertainty

    Collecting (or "sampling") information that one expects to be useful is a powerful way to facilitate learning. However, relatively little is known about how people decide which information is worth sampling over the course of learning. We describe several alternative models of how people might decide to collect a piece of information inspired by "active learning" research in machine learning. We additionally provide a theoretical analysis demonstrating the situations under which these models are empirically distinguishable, and we report a novel empirical study that exploits these insights. Our model-based analysis of participants’ information gathering decisions reveals that people prefer to select items which resolve uncertainty between two possibilities at a time rather than items that have high uncertainty across all relevant possibilities simultaneously. Rather than adhering to strictly normative or confirmatory conceptions of information search, people appear to prefer a "local" sampling strategy, which may reflect cognitive constraints on the process of information gathering. ... Read more

    D.B. Markant, B. Settles, and T.M. Gureckis
  • EDM 2015 • Best Paper Award

    Mixture Modeling of Individual Learning Curves

    We show that student learning can be accurately modeled using a mixture of learning curves, each of which specifies error probability as a function of time. This approach generalizes Knowledge Tracing, which can be viewed as a mixture model in which the learning curves are step functions. We show that this generality yields order-of-magnitude improvements in prediction accuracy on real data. Furthermore, examination of the learning curves provides actionable insights into how different segments of the student population are learning. To make our mixture model more expressive, we allow the learning curves to be defined by generalized linear models with arbitrary features. This approach generalizes Additive Factor Models and Performance Factors Analysis, and outperforms them on a large, real world dataset. ... Read more

    M. Streeter

Tools & Data

  • CEFR Checker

    Public version of a tool used inside Duolingo to develop content that is appropriate for different learner levels (beginner, intermediate, etc.). It is aligned to the CEFR framework and uses multilingual domain adaptation to learn from English CEFR-labeled vocabulary to other languages.

  • SLAM Shared Task

    Data for the 2018 Shared Task on Second Language Acquisition Modeling (SLAM). This corpus contains 7 million words produced by learners of English, Spanish, and French. It includes user demographics, morph-syntactic metadata, response times, and longitudinal errors for 6k+ users over 30 days.

  • Spaced Repetition

    Data used to develop our half-life regression (HLR) spaced repetition algorithm. This is a collection of 13 million user-word pairs for learners of several languages with a variety of language backgrounds. It includes practice recall rates, lag times between practices, and other morpho-lexical metadata.

Our Team

We are a diverse team of experts in AI and machine learning, data science, learning sciences, UX research, linguistics, and psychometrics. We work closely with product teams to build innovative features based on world-class research. We are growing, so check out our job openings below!

  • Burr Settles AI + Machine Learning
  • André Horie AI + Machine Learning
  • Bożena Pająk Learning Science
  • Erin Gustafson Data Science
  • Chris Brust AI + Machine Learning
  • Cindy Berger Learning Science
  • Angela DiCostanzo Curriculum Design
  • Cindy Blanco Learning Science
  • Lisa Bromberg Curriculum Design
  • Jenna Lake AI + Machine Learning
  • Bill McDowell AI + Machine Learning
  • Lowell Reade UX Research
  • Klinton Bicknell AI + Machine Learning
  • Will Monroe AI + Machine Learning
  • Geoff LaFlair Testing + Psychometrics
  • Hope Wilson Learning Science
  • Kevin Yancey AI + Machine Learning
  • Xiangying Jiang Learning Science
  • Jessica Becker Curriculum Design
  • Graham Arthur Data Science
  • Stephen Mayhew AI + Machine Learning
  • Meredith McDermott UX Research
  • Andrew Runge AI + Machine Learning
  • Connor Brem AI + Machine Learning