Duolingo Research

Science powers our mission to make language education free and accessible to everyone.

About Us

Our Team

  • Burr Settles AI + Machine Learning
  • André Horie AI + Machine Learning
  • Bożena Pająk Learning + Curriculum
  • Erin Gustafson Data Science
  • Chris Brust AI + Machine Learning
  • Cindy Berger Learning + Curriculum
  • Angela DiCostanzo Learning + Curriculum
  • Cindy Blanco Learning + Curriculum
  • Lisa Bromberg Learning + Curriculum
  • Jenna Lake AI + Machine Learning
  • Bill McDowell AI + Machine Learning
  • Lowell Reade UX Research
  • Klinton Bicknell AI + Machine Learning
  • Will Monroe AI + Machine Learning
  • Geoff LaFlair Assessment + Psychometrics
  • Hope Wilson Learning + Curriculum
  • Kevin Yancey AI + Machine Learning
  • Xiangying Jiang Learning + Curriculum
  • Joseph Rollinson Engineering
  • Jessica Becker Learning + Curriculum
  • Graham Arthur Data Science
  • Stephen Mayhew AI + Machine Learning
  • Meredith McDermott UX Research
  • Andrew Runge AI + Machine Learning
  • Connor Brem AI + Machine Learning
  • Anna Savage UX Research
  • Emily Moline Learning + Curriculum

Data & Tools

  • 2020 STAPLE Shared Task Data

    Data for the 2020 Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). This corpus contains more than 3 million pairs of English sentences with multiple possible translations into Portuguese, Hungarian, Japanese, Korean, and Vietnamese.

  • 2018 SLAM Shared Task Data

    Data for the 2018 Shared Task on Second Language Acquisition Modeling (SLAM). This corpus contains 7 million words produced by learners of English, Spanish, and French. It includes user demographics, morph-syntactic metadata, response times, and longitudinal errors for 6k+ users over 30 days.

  • CEFR Checker

    Public version of a tool used inside Duolingo to develop content that is appropriate for different learner levels (beginner, intermediate, etc.). It is aligned to the CEFR framework and uses multilingual domain adaptation to learn from English CEFR-labeled vocabulary to other languages.

  • Spaced Repetition Data

    Data used to develop our half-life regression (HLR) spaced repetition algorithm. This is a collection of 13 million user-word pairs for learners of several languages with a variety of language backgrounds. It includes practice recall rates, lag times between practices, and other morpho-lexical metadata.

Publications