Data Skeptic is a data science podcast exploring machine learning, statistics, artificial intelligence, and other data topics through short tutorials and interviews with domain experts.
Very Large Corpora and Zipf's Law
The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to face a problem that was highly dimensional. Naive Bayes algorithm was celebrated in NLP applications because of its ability to efficiently process highly dimensional data. Of course, other algorithms were applied to natural language tasks as well. While different algorithms had different strengths and weaknesses to different NLP problems, an early paper titledScaling to Very Very Large Corpora for Natural Language Disambiguationpopularized one somewhat surprising idea. For many NLP tasks, simply providing a large corpus of examples not only improved accuracy, but it also showed that asymptotically, some algorithms yielded more improvement from working on very, very large corpora. Although not explicitly in about NLP, the noteworthy paperThe Unreasonable Effectiveness of Dataemphasizes this point further while paying homage to the classic treatiseThe Unreasonable Effectiveness of Mathematics in the Natural Sciences. In this episode, Kyle shares a few thoughts along these lines with Linh Da. The discussion winds up with a brief introduction to Zipf's law. When applied to natural language, Zipf's law states that the frequency of any given word in a corpus (regardless of language) will be proportional to its rank in the frequency table.
Semantic search at Github
Github is many things besides source control. It's a social network, even though not everyone realizes it. It's a vast repository of code. It's a ticketing and project management system. And of course, it has search as well. In this episode, Kyle interviews Hamel Husain about his research into semantic code search.
Let's Talk About Natural Language Processing
This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of the classic problems are, and just a bit on approaches. Finishing out the show is an interview with Lucy Park about her work on theKoNLPy library for Korean NLP in Python. If you want to share your NLP project, please join our Slack channel. We're eager to see what listeners are working on! http://konlpy.org/en/latest/
See More Episodes on Himalaya