Towards Data Science

Towards Data Science

  • 概覽
  • 聲音
概覽
himalaya
124 聲音
Researchers and business leaders at the forefront of the field unpack the most pressing questions around data science and AI.
查看更多
聲音
124聲音

There’s a website called thispersondoesnotexist.com. When you visit it, you’re confronted by a high-resolution, photorealistic AI-generated picture of a human face. As the website’s name suggests, there’s no human being on the face of the earth who looks quite like the person staring back at you on the page. Each of those generated pictures are a piece of data that captures so much of the essence of what it means to look like a human being. And yet they do so without telling you anything whatsoever about any particular person. In that sense, it’s fully anonymous human face data. That’s impressive enough, and it speaks to how far generative image models have come over the last decade. But what if we could do the same for any kind of data? What if I could generate an anonymized set of medical records or financial transaction data that captures all of the latent relationships buried in a private dataset, without the risk of leaking sensitive information about real people? That’s...

Two ML researchers with world-class pedigrees who decided to build a company that puts AI on the blockchain. Now to most people — myself included — “AI on the blockchain” sounds like a winning entry in some kind of startup buzzword bingo. But what I discovered talking to Jacob and Ala was that they actually have good reasons to combine those two ingredients together. At a high level, doing AI on a blockchain allows you to decentralize AI research and reward labs for building better models, and not for publishing papers in flashy journals with often biased reviewers. And that’s not all — as we’ll see, Ala and Jacob are taking on some of the thorniest current problems in AI with their decentralized approach to machine learning. Everything from the problem of designing robust benchmarks to rewarding good AI research and even the centralization of power in the hands of a few large companies building powerful AI systems — these problems are all in their sights as they build out Bittensor, their AI-on-the-blockchain-startup. Ala and Jacob joined me to talk about all those things and more on this episode of the TDS podcast. --- Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc --- Chapters: 2:40 Ala and Jacob’s backgrounds 4:00 The basics of AI on the blockchain 11:30 Generating human value 17:00 Who sees the benefit? 22:00 Use of GPUs 28:00 Models learning from each other 37:30 The size of the network 45:30 The alignment of these systems 51:00 Buying into a system 54:00 Wrap-up

As you might know if you follow the podcast, we usually talk about the world of cutting-edge AI capabilities, and some of the emerging safety risks and other challenges that the future of AI might bring. But I thought that for today’s episode, it would be fun to change things up a bit and talk about the applied side of data science, and how the field has evolved over the last year or two. And I found the perfect guest to do that with: her name is Sadie St. Lawrence, and among other things, she’s the founder of Women in Data — a community that helps women enter the field of data and advance throughout their careers — and she’s also the host of the Data Bytes podcast, a seasoned data scientist and a community builder extraordinaire. Sadie joined me to talk about her founder’s journey, what data science looks like today, and even the possibilities that blockchains introduce for data science on this episode of the towards data science podcast. *** Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc *** Chapters: 2:00 Founding Women in Data 6:30 Having gendered conversations 11:00 The cultural aspect 16:45 Opportunities in blockchain 22:00 The blockchain database 32:30 Data science education 37:00 GPT-3 and unstructured data 39:30 Data science as a career 42:50 Wrap-up

If the name data2vec sounds familiar, that’s probably because it made quite a splash on social and even traditional media when it came out, about two months ago. It’s an important entry in what is now a growing list of strategies that are focused on creating individual machine learning architectures that handle many different data types, like text, image and speech. Most self-supervised learning techniques involve getting a model to take some input data (say, an image or a piece of text) and mask out certain components of those inputs (say by blacking out pixels or words) in order to get the models to predict those masked out components. That “filling in the blanks” task is hard enough to force AIs to learn facts about their data that generalize well, but it also means training models to perform tasks that are very different depending on the input data type. Filling in blacked out pixels is quite different from filling in blanks in a sentence, for example. So what if there was a way to come up with one task that we could use to train machine learning models on any kind of data? That’s where data2vec comes in. For this episode of the podcast, I’m joined by Alexei Baevski, a researcher at Meta AI one of the creators of data2vec. In addition to data2vec, Alexei has been involved in quite a bit of pioneering work on text and speech models, including wav2vec, Facebook’s widely publicized unsupervised speech model. Alexei joined me to talk about how data2vec works and what’s next for that research direction, as well as the future of multi-modal learning. *** Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc *** Chapters: 2:00 Alexei’s background 10:00 Software engineering knowledge 14:10 Role of data2vec in progression 30:00 Delta between student and teacher 38:30 Losing interpreting ability 41:45 Influence of greater abilities 49:15 Wrap-up

AI scaling has really taken off. Ever since GPT-3 came out, it’s become clear that one of the things we’ll need to do to move beyond narrow AI and towards more generally intelligent systems is going to be to massively scale up the size of our models, the amount of processing power they consume and the amount of data they’re trained on, all at the same time. That’s led to a huge wave of highly scaled models that are incredibly expensive to train, largely because of their enormous compute budgets. But what if there was a more flexible way to scale AI — one that allowed us to decouple model size from compute budgets, so that we can track a more compute-efficient course to scale? That’s the promise of so-called mixture of experts models, or MoEs. Unlike more traditional transformers, MoEs don’t update all of their parameters on every training pass. Instead, they route inputs intelligently to sub-models called experts, which can each specialize in different tasks. On a given training pass, only those experts have their parameters updated. The result is a sparse model, a more compute-efficient training process, and a new potential path to scale. Google has been pushing the frontier of research on MoEs, and my two guests today in particular have been involved in pioneering work on that strategy (among many others!). Liam Fedus and Barrett Zoph are research scientists at Google Brain, and they joined me to talk about AI scaling, sparsity and the present and future of MoE models on this episode of the TDS podcast. *** Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc *** Chapters: 2:15 Guests’ backgrounds 8:00 Understanding specialization 13:45 Speculations for the future 21:45 Switch transformer versus dense net 27:30 More interpretable models 33:30 Assumptions and biology 39:15 Wrap-up

There’s an idea in machine learning that most of the progress we see in AI doesn’t come from new algorithms of model architectures. instead, some argue, progress almost entirely comes from scaling up compute power, datasets and model sizes — and besides those three ingredients, nothing else really matters. Through that lens the history of AI becomes the history f processing power and compute budgets. And if that turns out to be true, then we might be able to do a decent job of predicting AI progress by studying trends in compute power and their impact on AI development. And that’s why I wanted to talk to Jaime Sevilla, an independent researcher and AI forecaster, and affiliate researcher at Cambridge University’s Centre for the Study of Existential Risk, where he works on technological forecasting and understanding trends in AI in particular. His work’s been cited in a lot of cool places, including Our World In Data, who used his team’s data to put together an exposé on trends in compute. Jaime joined me to talk about compute trends and AI forecasting on this episode of the TDS podcast. *** Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc *** Chapters: 2:00 Trends in compute 4:30 Transformative AI 13:00 Industrial applications 19:00 GPT-3 and scaling 25:00 The two papers 33:00 Biological anchors 39:00 Timing of projects 43:00 The trade-off 47:45 Wrap-up

Generating well-referenced and accurate Wikipedia articles has always been an important problem: Wikipedia has essentially become the Internet's encyclopedia of record, and hundreds of millions of people use it do understand the world. But over the last decade Wikipedia has also become a critical source of training data for data-hungry text generation models. As a result, any shortcomings in Wikipedia’s content are at risk of being amplified by the text generation tools of the future. If one type of topic or person is chronically under-represented in Wikipedia’s corpus, we can expect generative text models to mirror — or even amplify — that under-representation in their outputs. Through that lens, the project of Wikipedia article generation is about much more than it seems — it’s quite literally about setting the scene for the language generation systems of the future, and empowering humans to guide those systems in more robust ways. That’s why I wanted to talk to Meta AI researcher Angela Fan, whose latest project is focused on generating reliable, accurate, and structured Wikipedia articles. She joined me to talk about her work, the implications of high-quality long-form text generation, and the future of human/AI collaboration on this episode of the TDS podcast. --- Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc --- Chapters: 1:45 Journey into Meta AI 5:45 Transition to Wikipedia 11:30 How articles are generated 18:00 Quality of text 21:30 Accuracy metrics 25:30 Risk of hallucinated facts 30:45 Keeping up with changes 36:15 UI/UX problems 45:00 Technical cause of gender imbalance 51:00 Wrap-up

Trustworthy AI is one of today’s most popular buzzwords. But although everyone seems to agree that we want AI to be trustworthy, definitions of trustworthiness are often fuzzy or inadequate. Maybe that shouldn’t be surprising: it’s hard to come up with a single set of standards that add up to “trustworthiness”, and that apply just as well to a Netflix movie recommendation as a self-driving car. So maybe trustworthy AI needs to be thought of in a more nuanced way — one that reflects the intricacies of individual AI use cases. If that’s true, then new questions come up: who gets to define trustworthiness, and who bears responsibility when a lack of trustworthiness leads to harms like AI accidents, or undesired biases? Through that lens, trustworthiness becomes a problem not just for algorithms, but for organizations. And that’s exactly the case that Beena Ammanath makes in her upcoming book, Trustworthy AI, which explores AI trustworthiness from a practical perspective, looking at what concrete steps companies can take to make their in-house AI work safer, better and more reliable. Beena joined me to talk about defining trustworthiness, explainability and robustness in AI, as well as the future of AI regulation and self-regulation on this episode of the TDS podcast. Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc Chapters: 1:55 Background and trustworthy AI 7:30 Incentives to work on capabilities 13:40 Regulation at the level of application domain 16:45 Bridging the gap 23:30 Level of cognition offloaded to the AI 25:45 What is trustworthy AI? 34:00 Examples of robustness failures 36:45 Team diversity 40:15 Smaller companies 43:00 Application of best practices 46:30 Wrap-up

Until recently, very few people were paying attention to the potential malicious applications of AI. And that made some sense: in an era where AIs were narrow and had to be purpose-built for every application, you’d need an entire research team to develop AI tools for malicious applications. Since it’s more profitable (and safer) for that kind of talent to work in the legal economy, AI didn’t offer much low-hanging fruit for malicious actors. But today, that’s all changing. As AI becomes more flexible and general, the link between the purpose for which an AI was built and its potential downstream applications has all but disappeared. Large language models can be trained to perform valuable tasks, like supporting writers, translating between languages, or write better code. But a system that can write an essay can also write a fake news article, or power an army of humanlike text-generating bots. More than any other moment in the history of AI, the move to scaled, general-purpose foundation models has shown how AI can be a double-edged sword. And now that these models exist, we have to come to terms with them, and figure out how to build societies that remain stable in the face of compelling AI-generated content, and increasingly accessible AI-powered tools with malicious use potential. That’s why I wanted to speak with Katya Sedova, a former Congressional Fellow and Microsoft alumna who now works at Georgetown University’s Center for Security and Emerging Technology, where she recently co-authored some fascinating work exploring current and likely future malicious uses of AI. If you like this conversation I’d really recommend checking out her team’s latest report — it’s called “AI and the future of disinformation campaigns”. Katya joined me to talk about malicious AI-powered chatbots, fake news generation and the future of AI-augmented influence campaigns on this episode of the TDS podcast. *** Intro music: ➞ Artist: Ron Gelinas ➞ Track Title: Daybreak Chill Blend (original mix) ➞ Link to Track: https://youtu.be/d8Y2sKIgFWc *** Chapters: 2:40 Malicious uses of AI 4:30 Last 10 years in the field 7:50 Low handing fruit of automation 14:30 Other analytics functions 25:30 Authentic bots 30:00 Influences of service businesses 36:00 Race to the bottom 42:30 Automation of systems 50:00 Manufacturing norms 52:30 Interdisciplinary conversations 54:00 Wrap-up

Imagine, for example, an AI that’s trained to identify cows in images. Ideally, we’d want it to learn to detect cows based on their shape and colour. But what if the cow pictures we put in the training dataset always show cows standing on grass? In that case, we have a spurious correlation between grass and cows, and if we’re not careful, our AI might learn to become a grass detector rather than a cow detector. Even worse, we could only realize that’s happened once we’ve deployed it in the real world and it runs into a cow that isn’t standing on grass for the first time. So how do you build AI systems that can learn robust, general concepts that remain valid outside the context of their training data? That’s the problem of out-of-distribution generalization, and it’s a central part of the research agenda of Irina Rish, a core member of the Mila— Quebec AI Research institute, and the Canadian Excellence Research Chair in Autonomous AI. Irina’s research explores many different strategies that aim to overcome the out-of-distribution problem, from empirical AI scaling efforts to more theoretical work, and she joined me to talk about just that on this episode of the podcast. *** Intro music: - Artist: Ron Gelinas - Track Title: Daybreak Chill Blend (original mix) - Link to Track: https://youtu.be/d8Y2sKIgFWc *** Chapters: 2:00 Research, safety, and generalization 8:20 Invariant risk minimization 15:00 Importance of scaling 21:35 Role of language 27:40 AGI and scaling 32:30 GPT versus ResNet 50 37:00 Potential revolutions in architecture 42:30 Inductive bias aspect 46:00 New risks 49:30 Wrap-up

123...13
常見問題
  • Himalaya 是什麼?
    喜馬拉雅國際版,Himalaya 是一款有聲書 App,旨在為全球華人的終身學習提供隨時、隨地、隨心的全新聽書體驗。成為會員,即可以暢聽站內 100,000+ 海量會員內容。
  • Himalaya VIP 有什麼權益?
    你僅需花費每日低至 0.16 美金,就可以立即暢聽 100,000+ 全球銷量超百萬的暢銷有聲書,每週聽一本爆款新書,還有更多預售新書等著你!另可獲得每月 5 張免費體驗卡贈親友的福利,等同於贈送 1 張年卡的價值。
  • 我怎麼享受免費試用?
    現在訂閱 Himalaya VIP 即可享受至少 7 天的免費試用! 免費試用期內,無需付費即可免費暢聽會員包中的全部內容,包含 100,000+ 全球銷量超百萬的暢銷有聲書,和世界名校教授的原聲英文課程。
  • 我該怎麼使用優惠碼?
    在 Himalaya 首⻚選擇「開啟免費體驗」註冊完成之後, 輸入「優惠碼」選擇申請,支付成功後即可開啟 Himalaya VIP 內容免費暢聽權益!
  • 可以在哪收聽?
    Himalaya 提供你隨時隨地想听就听的服務, 可以下載 Himalaya APP 使用手機享受服務,同時也支持網頁版登陸在電腦上享受暢聽服務。
  • Himalaya VIP 的價格是多少?
    Himalaya VIP 採用連續訂閱的模式,按月訂閱價格為 $11.99/月;按年訂閱價格為 $59.99/年。每天僅需 0.16 美元,讓耳朵隨時隨地步入擁有 100,000+ 書籍你的專屬圖書館。
  • 我不想訂閱了,要如何取消?
    通過網頁端訂閱如何取消?
    你可以 點擊這裡 取消訂閱。 在試用期內取消訂閱,則不會自動續費;如果你已經成功續費後取消訂閱,則下個扣款週期不會自動續費。
    通過手機端訂閱如何取消?
    你可以在iTunes/Apple或Google Play設定中取消訂閱。在試用期到期前48小時取消訂閱,則不會自動續費;如果你已經成功續費後取消訂閱,則下個扣款週期不會自動續費。你可以通過以下連結找到如何取消訂閱的詳細資訊:Apple Store取消訂閱方法  Google Play取消訂閱方法

與Himalaya一起

每天15分鐘
在碎片的時間裡,學習一個知識點;通勤時、家務時、運動時,隨時隨地暢聽
每週1本新書
優選最新最熱暢銷書,資深編輯精心挑選榜單佳作,只聽有價值的好書
每年10大系列
商業財經、歷史文化、親子育兒,同系列好書好課一網打盡,帶你深入探究一個主題
app store
google play