文章来源:Why Most Machine Learning Projects Fail to Reach Production - InfoQ

【原文摘要】
Why Most Machine Learning Projects Fail to Reach Production
By Wenjie Zi, reviewed by Arthur Casals
This article summarizes a 2024 QCon San Francisco talk, drawing on the author’s decade of experience in ML across social media, fintech, and productivity tools. It explores why many ML projects never make it to production, focusing on avoidable "bad failures"—projects that drag on without clear goals, perform well offline but never deploy, or are unused post-deployment.
The Scope of the Problem
Industry data shows high ML project failure rates: a 2023 Rexer Analytics survey of over 300 ML practitioners found only 32% of projects reach production, with older studies citing rates as high as 85%. Failure rates vary by sector, with big tech leading in AI adoption and traditional enterprises/startups still navigating implementation. Not all failures are negative—quickly pivoting or scrapping a project after preliminary testing aligns with the "fail fast" innovation principle. This article focuses on preventable, costly failures.
The ML Project Lifecycle
A typical ML project follows an iterative six-step lifecycle:
- Identify a business goal to optimize with ML
- Frame the goal as an ML problem
- Explore and process data for model training
- Train and select top-performing models
- Deploy models and monitor performance
- Use monitoring feedback to refine the system
This cycle differs from traditional software projects due to ML’s inherent uncertainty and data dependence, making clear goal-setting and cross-team alignment critical from the start.
Five Common Pitfalls (and Solutions)
1. Optimizing the Wrong Problem
Rexer’s survey found only 29% of ML projects have clearly defined objectives at initiation, with 26% reporting rare clarity. Vague business goals lead to costly reworks, as late changes require adjusting data, objective functions, and pipelines.
【译文摘要】
为何多数机器学习项目无法落地投产
作者:子文杰,审稿人:亚瑟·卡萨斯
本文总结了2024年旧金山QCon大会的一场演讲内容,参考了作者在社交媒体、金融科技及生产力工具领域积累的十年机器学习从业经验。文章探究了众多机器学习项目无法落地投产的原因,重点聚焦于可避免的“恶性失败”——即那些目标模糊、拖沓无期,离线测试表现优异却始终无法部署,或是部署后无人问津的项目。
问题的波及范围
行业数据显示机器学习项目的失败率居高不下:2023年Rexer Analytics对300余名机器学习从业者开展的调查显示,仅有32%的项目能够落地投产,更早的研究则指出失败率最高可达85%。不同行业的失败率存在差异,大型科技公司在人工智能应用方面处于领先地位,而传统企业和初创公司仍在摸索落地路径。并非所有失败都是负面的——在初步测试后迅速调整方向或终止项目,符合“快速试错”的创新原则。本文重点关注可预防、成本高昂的失败案例。
机器学习项目生命周期
一个典型的机器学习项目遵循迭代式的六步生命周期:
- 确定可通过机器学习优化的业务目标
- 将该目标转化为机器学习问题
- 探索并处理用于模型训练的数据
- 训练并筛选出表现最优的模型
- 部署模型并监控其性能
- 利用监控反馈优化系统
由于机器学习本身存在不确定性且依赖数据,这一周期与传统软件项目有所不同,因此从项目启动阶段就明确目标并实现跨团队协同至关重要。
五大常见陷阱及解决方案
1. 优化错误的问题
Rexer的调查发现,仅有29%的机器学习项目在启动阶段就有清晰明确的目标,还有26%的项目表示几乎从未有过清晰的目标。模糊的业务目标会导致高昂的返工成本,因为后期调整需要改动数据、目标函数和工作流程。
【单词表】
- productivity /ˌprɒdʌkˈtɪvəti/ 生产力,生产效率
- deploy /dɪˈplɔɪ/ 部署,调度
- post-deployment /ˌpəʊst dɪˈplɔɪmənt/ 部署后
- practitioner /prækˈtɪʃənə(r)/ 从业者,执业者
- cite /saɪt/ 引用,援引
- sector /ˈsektə(r)/ 行业,部门
- adoption /əˈdɒpʃn/ 采用,接纳
- implementation /ˌɪmplɪmenˈteɪʃn/ 实施,执行
- pivot /ˈpɪvət/ (使)转向,调整方向
- scrap /skræp/ 放弃,抛弃
- preliminary /prɪˈlɪmɪnəri/ 初步的,预备的
- principle /ˈprɪnsəpl/ 原则,准则
- iterative /ˈɪtərətɪv/ 迭代的,重复的
- inherent /ɪnˈherənt/ 固有的,内在的
- uncertainty /ʌnˈsɜːtnti/ 不确定性,不可靠性
- dependence /dɪˈpendəns/ 依赖,依靠
- alignment /əˈlaɪnmənt/ 协同,一致
- critical /ˈkrɪtɪkl/ 至关重要的,关键的
- pitfall /ˈpɪtfɔːl/ 陷阱,隐患
- initiation /ɪˌnɪʃiˈeɪʃn/ 启动,开端
- vague /veɪɡ/ 模糊的,不明确的
- rework /ˌriːˈwɜːk/ 返工,重做
- objective /əbˈdʒektɪv/ 目标,目的
- pipeline /ˈpaɪplaɪn/ (计算机)工作流程,管线
【句子翻译】
- This article summarizes a 2024 QCon San Francisco talk, drawing on the author’s decade of experience in ML across social media, fintech, and productivity tools. 本文总结了2024年旧金山QCon大会的一场演讲内容,参考了作者在社交媒体、金融科技及生产力工具领域积累的十年机器学习从业经验。
- It explores why many ML projects never make it to production, focusing on avoidable "bad failures"—projects that drag on without clear goals, perform well offline but never deploy, or are unused post-deployment. 文章探究了众多机器学习项目无法落地投产的原因,重点聚焦于可避免的“恶性失败”——即那些目标模糊、拖沓无期,离线测试表现优异却始终无法部署,或是部署后无人问津的项目。
- Industry data shows high ML project failure rates: a 2023 Rexer Analytics survey of over 300 ML practitioners found only 32% of projects reach production, with older studies citing rates as high as 85%. 行业数据显示机器学习项目的失败率居高不下:2023年Rexer Analytics对300余名机器学习从业者开展的调查显示,仅有32%的项目能够落地投产,更早的研究则指出失败率最高可达85%。
- Failure rates vary by sector, with big tech leading in AI adoption and traditional enterprises/startups still navigating implementation. 不同行业的失败率存在差异,大型科技公司在人工智能应用方面处于领先地位,而传统企业和初创公司仍在摸索落地路径。
- Not all failures are negative—quickly pivoting or scrapping a project after preliminary testing aligns with the "fail fast" innovation principle. 并非所有失败都是负面的——在初步测试后迅速调整方向或终止项目,符合“快速试错”的创新原则。
- A typical ML project follows an iterative six-step lifecycle: 一个典型的机器学习项目遵循迭代式的六步生命周期:
- This cycle differs from traditional software projects due to ML’s inherent uncertainty and data dependence, making clear goal-setting and cross-team alignment critical from the start. 由于机器学习本身存在不确定性且依赖数据,这一周期与传统软件项目有所不同,因此从项目启动阶段就明确目标并实现跨团队协同至关重要。
- Rexer’s survey found only 29% of ML projects have clearly defined objectives at initiation, with 26% reporting rare clarity. Rexer的调查发现,仅有29%的机器学习项目在启动阶段就有清晰明确的目标,还有26%的项目表示几乎从未有过清晰的目标。
- Vague business goals lead to costly reworks, as late changes require adjusting data, objective functions, and pipelines. 模糊的业务目标会导致高昂的返工成本,因为后期调整需要改动数据、目标函数和工作流程。