+++ The 9 components in a real-world Machine Learning system +++ Data Science: Reality Doesn’t Meet Expectations +++ Gelman’s Bayesian Data Analysis in 3rd Edition - For Free! +++
Breakthrough — Or So They Say
The 9 components in a real-world Machine Learning system. On LinkedIn, Louis Dorard has written an interesting 2-article mini series on ML platforms. His first article, An overview of ML development platforms, contains a pretty complete overview on platforms such AWS Sagemaker, Databricks, or H2O, but then also several others ones I hadn’t heard of before. The second article, which is even more interesting, describes a 9-element architecture for ML systems, which contains the following components: Ground-truth Collector, Data Labeller, Evaluator, Performance Monitor, Featurizer, Orchestrator, Model Builder, Model Server, and Front-end.
Tools and Frameworks
Business News and Applications
Data Science: Reality Doesn’t Meet Expectations. For everyone working in an industry data science job, the title of this article is probably not too much of a surprise. However, I still liked the spot-on description of some of key issues here: (1) Data science can’t always be built to specs. In an exploratory analysis, someone or yourself may have questions on the data and you simply want to answer those. However, the data needed to answer those questions may not exist! If the data does exists, it’s likely “dirty” - undocumented, tough to find or could be factually inaccurate. It’ll be tough to work with! You could spend hours or days attempting to answer a single question only to discover that you can’t sufficiently answer it for a stakeholder. In machine learning, you may be asked to optimize some process or experience for consumers. However, there’s uncertainty with how much, if at all, the experience can be improved! Since many ML projects may not be built to the expectations of teams, most projects likely don’t make it into production. (2) Your impact is tough to measure — data doesn’t always translate to value. Most organizations make the majority of their decisions on intuition that stems from past readings and personal experiences - not from a Data Scientist’s analyses. As a Data Scientist in the org, are you essential to the business? Probably not. Most people don’t put a price or value on analyses. The business could go on for a while and survive without you. Sales will still be made, features will still get built, customer support will handle customer concerns, etc. Did you save the company 10 million dollars through your analysis for the sales team that led them to avoiding a huge and costly new workflow, or did the sales team save the 10 million dollars? Truthfully, will anyone even value the 10 million dollars if it was simply “saved” and never spent? (3) Data & infrastructure have serious quality problems. In regards to quality of data on the job, I’d often compare it to a garbage bag that ripped, had its content spewed all over the ground and your partner has asked you to find a beautiful earring that was accidentally inside. Cleaning data may likely become the majority of your work. In 2016, a survey distributed to experienced Data Scientists by the popular ML-focused company Crowdflower claimed “3 out of every 5 data scientists we surveyed actually spend the most time cleaning and organizing data”. The Data Scientist, in many cases, should be called the Data Janitor. In addition to “dirty data”, another major challenge is handling data with poor infrastructure. Imagine a busy highway with lots of potholes, toll booths, and traffic. Your job is to somehow navigate these treacherous conditions to supply data insights at the end. You may be faced with a database that isn’t optimized for your queries or unable to identify the source of truth in the data through its data lineage. You may wait days or weeks to get access to a database or be stuck with poor infrastructure because people are afraid to change it for fear of breaking everything! Sounds familiar?
Publications
Gelman’s Bayesian Data Analysis in 3rd Edition - For Free! The latest edition of Bayesian Data Analysis is now free to download for non-commercial purposes. Along with a download link, the landing page includes links to related course materials, demos, notes, and software, i.e., extensive code examples in R and Python (Jupyter notebooks). This highly acclaimed text was the winner of the 2016 De Groot Prize from the International Society for Bayesian Analysis.