With more than 300 million learners, Duolingo has the world's largest collection of language-learning data at its fingertips. This allows us to build unique systems, uncover new insights about the nature of language and learning, and apply existing theories at scales never before seen. We are also committed to sharing data and publications with the broader research community.
Data for the 2020 Shared Task on Simultaneous Translation And Paraphrase for Language Education (STAPLE). This corpus contains more than 3 million pairs of English sentences with multiple possible translations into Portuguese, Hungarian, Japanese, Korean, and Vietnamese.
Data for the 2018 Shared Task on Second Language Acquisition Modeling (SLAM). This corpus contains 7 million words produced by learners of English, Spanish, and French. It includes user demographics, morph-syntactic metadata, response times, and longitudinal errors for 6k+ users over 30 days.
Public version of a tool used inside Duolingo to develop content that is appropriate for different learner levels (beginner, intermediate, etc.). It is aligned to the CEFR framework and uses multilingual domain adaptation to learn from English CEFR-labeled vocabulary to other languages.
Data used to develop our half-life regression (HLR) spaced repetition algorithm. This is a collection of 13 million user-word pairs for learners of several languages with a variety of language backgrounds. It includes practice recall rates, lag times between practices, and other morpho-lexical metadata.
We are a diverse team of experts in AI and machine learning, data science, learning sciences, UX research, linguistics, and psychometrics. We work closely with product teams to build innovative features based on world-class research. We are growing, so check out our job openings below!