AI weekly (10/2020)

My selection of news on AI/ML and Data Science

AI weekly (10/2020)

My selection of news on AI/ML and Data Science

+++ Kubeflow 1.0 Brings a Production-Ready Machine Learning Toolset to Kubernetes +++ SciPy 1.0: fundamental algorithms for scientific computing in Python +++ GitLab Handbook +++ Largest gift in Berkeley’s history will create a ‘hub’ for advancing data science +++ Machine Learning with Python +++ Pattern Recognition & Machine Learning algorithms +++

Breakthrough — Or So They Say

Tools and Frameworks

Kubeflow 1.0 Brings a Production-Ready Machine Learning Toolset to Kubernetes. Kubeflow’s goal is to make it easy for machine learning engineers and data scientists to leverage cloud assets (public or on-premise) for ML workloads. It can run on any Kubernetes-conformant cluster. Kubeflow was open sourced at Kubecon USA in December 2017, and during the last two years the Kubeflow Project has grown beyond expectations. There are currently hundreds of contributors from over 30 participating organizations such as Google, Cisco, IBM, Microsoft, Red Hat, Amazon Web Services and Alibaba. Now, on behalf of the entire community, the first major release Kubeflow 1.0 has been announced. Kubeflow 1.0 allows users to use Jupyter to develop models, use a Kubeflow tool like fairing (Kubeflow’s python SDK) to build containers and create Kubernetes resources to train their models, and finally use KFServing (built on top of Knative) to create and deploy a server for inference. Currently, Kubeflow comes with a number of “graduated” applications, which include the Kubeflow UI, the Jupyter notebook controller and web app, a Tensorflow Operator (TFJob) and a PyTorch Operator for distributed training, kfctl for deployment and upgrades, and a profile controller and UI for multiuser management. Additionally, the project team explains that there are several more applications currently under beta, which they plan to graduate to version 1.0 in a future release. Those applications include pipelines for defining complex ML workflows, metadata for tracking datasets, jobs, and models, katib for hyper-parameter tuning, and some additional distributed operators for other frameworks like xgboost. Getting started with Kubeflow in cloud platforms is as easy as a single command, offering pre-built manifests for GCP, AWS, IBM, and more.

SciPy 1.0: fundamental algorithms for scientific computing in Python. SciPy is an open-source scientific computing library for the Python programming language. SciPy is built on top of NumPy, which provides array data structures and related fast numerical routines, and SciPy is itself the foundation upon which higher level scientific libraries, including scikit-learn and scikit-image, are built. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. Recently, SciPy released version 1.0, a milestone that traditionally signals a library’s API being mature enough to be trusted in production pipelines. This version numbering convention, however, belies the history of a project that has become the standard others follow and has seen extensive adoption in research and industry. This article in Nature Methods is a review of the history, the architecture, the structure, and more of the community and the package itself.

Business News and Applications

GitLab Handbook. Quote: “The GitLab team handbook is the central repository for how we run the company. Printed, it consists of over 3,000 pages of text. As part of our value of being transparent the handbook is open to the world, and we welcome feedback. Please make a merge request to suggest improvements or add clarifications. Please use issues to ask questions.” A great read, hard to believe that is published to the world. Some interesting sections are e.g. on communication, engineering, engineering workflow and career development (also exemplified here for backend engineers).

Largest gift in Berkeley’s history will create a ‘hub’ for advancing data science. UC Berkeley’s Division of Computing, Data Science, and Society will soon have a brick-and-mortar home, thanks to an anonymous $252 million gift to seed the construction of a new “Data Hub”. The donation, which represents the single largest gift in Berkeley’s history, will provide core funding for a sustainable, visually striking facility that will serve as a hub for the diverse array of students and faculty engaged in computing and data science research and teaching, and will provide a new anchor for Berkeley’s fastest-growing new areas of study. Each year, approximately 6,000 undergraduates take a data science course at Berkeley, and the new data science major is the fastest-growing major on campus, approaching the size of computer science itself.

Publications

Machine Learning with Python. The open access journal Information is publishing a special issue “Machine Learning with Python”. One of the articles describes a ML workflow with focus on XAI. Some hightlights from the abstract: “This manuscript outlines a viable approach for training and evaluating machine learning systems for high-stakes, human-centered, or regulated applications using common Python programming tools. The accuracy and intrinsic interpretability of two types of constrained models, monotonic gradient boosting machines and explainable neural networks, a deep learning architecture well-suited for structured data, are assessed on simulated data and publicly available mortgage data. For maximum transparency and the potential generation of personalized adverse action notices, the constrained models are analyzed using post-hoc explanation techniques including plots of partial dependence and individual conditional expectation and with global and local Shapley feature importance. The constrained model predictions are also tested for disparate impact and other types of discrimination using measures with long-standing legal precedents, adverse impact ratio, marginal effect, and standardized mean difference, along with straightforward group fairness measures. By combining interpretable models, post-hoc explanations, and discrimination testing with accessible software tools, this text aims to provide a template workflow for machine learning applications that require high accuracy and interpretability and that mitigate risks of discrimination.”

Tutorials

Pattern Recognition & Machine Learning algorithms. A great collection of Jupyter notebooks with clean and easy to follow Python examples for each chapter of Christopher Bishop’s text, Pattern Recognition and Machine Learning, can be found in this GitHub repo. Alternative implementations in Matlab can be found here.

See also