AI weekly (43/2019)

My selection of news on AI/ML and Data Science

AI weekly (43/2019)

My selection of news on AI/ML and Data Science

+++ Quantum Supremacy Using a Programmable Superconducting Processor +++ Understanding searches better than ever before +++ The State of Machine Learning Frameworks in 2019 +++ Detectron2: A PyTorch-based modular object detection library +++ Welcome to Streamlit +++ bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond) +++ Deep Q-Network for Angry Birds +++ Machines Beat Humans on a Reading Test. But Do They Understand? +++ Kepler.GL & Jupyter Notebooks: Geospatial Data Visualization with Uber’s opensource Kepler.GL +++ SQLZoo – A Great Interactive SQL Tutorial +++ An overview of time series forecasting models +++ Python Data Preprocessing Using Pandas DataFrame, Spark DataFrame, and Koalas DataFrame +++

Breakthrough — Or So They Say

Quantum Supremacy Using a Programmable Superconducting Processor. Finally Google’s paper has officially been published by Nature after having been leaked earlier this year. Expository article by Google and Nature can be found here, here and here. The full Nature article can be freely downloaded from their website. Perspectives on this are provided e.g. by IEEE Spectrum, John Preskill on Quanta Magazin and by Scott Aaronson.

Understanding searches better than ever before. Google doesn’t write about its work on the actual Search product that often. Now Pandu Nayak, Google Fellow and Vice President Search, has published a blog post announcing one of the biggest updates to its search algorithm in recent years. By using BERT (“Bidirectional Encoder Representations from Transformers”) to better understand the intentions behind queries, Google says it can now offer more relevant results for about one in 10 searches in the U.S. in English (with support for other languages and locales coming later). In the world of search updates, where algorithm changes are often far more subtle, an update that affects 10% of searches is a big deal. Google notes that this update will work best for longer, more conversational queries because it’s easier to interpret a full sentence than a sequence of keywords.

Tools and Frameworks

The State of Machine Learning Frameworks in 2019. An article by Horace He in The Gradient analyzes the relative popularity of machine learning (more precisely: deep learning) frameworks. In 2019, the war for DL frameworks has two remaining main contenders: PyTorch and TensorFlow. The analysis suggests that researchers are abandoning TensorFlow and flocking to PyTorch in droves, with research papers using PyTorch at least twice as often as Tensorflow. Likely reasons are its ease of use, pythonic style, good integration with the Python ecosystem. Meanwhile in industry, Tensorflow is currently the platform of choice, but that may not be true for long. Potential advantages of Tensorflow mentioned are its graph format, the fact that the execution engine natively has no need for Python, and the fact that TensorFlow Lite and TensorFlow Serving address mobile and serving considerations respectively.

Detectron2: A PyTorch-based modular object detection library. Since its release in 2018, the Detectron object detection platform has become one of Facebook AI Research (FAIR)’s most widely adopted open source projects. To build on and advance this project, they are now sharing the second generation of the library. The platform is now no longer implemented in Caffe2, but in PyTorch. With a new, more modular design, Detectron2 is flexible and extensible, and able to provide fast training on single or multiple GPU servers. Detectron2 includes high-quality implementations of state-of-the-art object detection algorithms, including DensePose, panoptic feature pyramid networks, and numerous variants of the pioneering Mask R-CNN model family also developed by Facebook AI Research.

Welcome to Streamlit. Streamlit is an open-source Python library that makes it easy to build beautiful apps for machine learning. Install Streamlit, import it, write some code, and run your script. Streamlit watches for changes on each save and updates automatically, visualizing your output while you’re coding. Code runs from top to bottom, always from a clean state, and with no need for callbacks. It’s a simple and powerful app model that lets you build rich UIs incredibly quickly.

Business News and Applications

Publications

bamlss: A Lego Toolbox for Flexible Bayesian Regression (and Beyond) (arXiv:1909.11784). The R package bamlss is introduced for Bayesian additive models for location, scale, and shape (and beyond). At the core of the package are algorithms for highly-efficient Bayesian estimation and inference that can be applied to generalized additive models (GAMs) or generalized additive models for location, scale, and shape (GAMLSS), also known as distributional regression. However, its building blocks are designed as “Lego bricks” encompassing various distributions (exponential family, Cox, joint models, …), regression terms (linear, splines, random effects, tensor products, spatial fields, …), and estimators (MCMC, backfitting, gradient boosting, lasso, …). It is demonstrated how these can be easily recombined to make classical models more flexible or create new custom models for specific modeling challenges.

Deep Q-Network for Angry Birds (arXiv:1910.01806). Researchers at Charles University in Prague detail an AI system—DQ-Birds—trained using Deep Q-learning, a technique pioneered by Alphabet’s DeepMind that instructs an agent which action to take under what circumstances using a random sample of prior actions. Quote: “Angry Birds is a popular video game in which the player is provided with a sequence of birds to shoot from a slingshot. The task of the game is to destroy all green pigs with maximum possible score. Angry Birds appears to be a difficult task to solve for artificially intelligent agents due to the sequential decision-making, non-deterministic game environment, enormous state and action spaces and requirement to differentiate between multiple birds, their abilities and optimum tapping times. We describe the application of Deep Reinforcement learning by implementing Double Dueling Deep Q-network to play Angry Birds game. One of our main goals was to build an agent that is able to compete with previous participants and humans on the first 21 levels. In order to do so, we have collected a dataset of game frames that we used to train our agent on. We present different approaches and settings for DQN agent. We evaluate our agent using results of the previous participants of AIBirds competition, results of volunteer human players and present the results of AIBirds 2018 competition.”

Benchmarking simple machine learning models with feature extraction against modern black-box methods. A blog post by Martin Dittgen based on his MSc thesis at the University of Oxford. Quote: “Our results show that, for classification problems, the simple methods perform almost as well as the more complex methods. This is true for both binary classification, regarding the metric area under the ROC curve, and for multiclass classification with regards to the weighted F1 Score metric. In our regression problems, the modern methods outperform the simpler methods and seem superior to the more interpretable methods in terms of their respective R-squared or explained variance.”

Machines Beat Humans on a Reading Test. But Do They Understand? In the fall of 2017, Sam Bowman, a computational linguist at New York University, figured that computers still weren’t very good at understanding the written word. In an April 2018 paper coauthored with collaborators from the University of Washington and DeepMind Bowman introduced a battery of nine reading-comprehension tasks for computers called GLUE (General Language Understanding Evaluation). The machines bombed. Even state-of-the-art neural networks scored no higher than 69 out of 100 across all nine tasks: a D-plus, in letter grade terms. But then, in October of 2018, Google introduced a new method nicknamed BERT (Bidirectional Encoder Representations from Transformers). It produced a GLUE score of 80.5. On this brand-new benchmark designed to measure machines’ real understanding of natural language — or to expose their lack thereof — the machines had jumped from a D-plus to a B-minus in just six months. But is AI actually starting to understand our language — or is it just getting better at gaming our systems? Bowman points out that it’s hard to know how we would ever be fully convinced that a neural network achieves anything like real understanding. More background in an article on Quanta Magazine von John Pavlus.

Tutorials

Kepler.GL & Jupyter Notebooks: Geospatial Data Visualization with Uber’s opensource Kepler.GL kepler.gl is a web-based visualisation tool for large Geospatial datasets built on top of deck.gl. Uber open-sourced it last year, and its functionality is impressive. Now, thanks to Uber’s visualization team, kepler.gl can be integrated into Jupyter as a Widget. It loads kepler.gl inside a notebook cell, allowing users to quickly plot maps with simple python commands and interact with the UI to customize the visualization (Figure 1). It provides a seamless analysis workflow, combining data querying, transformation, analysis, and visualization — all inside Jupyter Notebook. This blog post on Towards DataScience provides a nice tutorial, more details can be found in the user guide.

SQLZoo — A Great Interactive SQL Tutorial. SQLZoo is a free, “learning-by-doing” SQL course that has you punching out queries and wracking your brain for the solutions from the very beginning. It’s fun (if you like puzzles), but definitely also teaches business-relevant SQL skills.

An overview of time series forecasting models. An article by David Burbas on towards data science gives a compact overview of time series forecasting models, describing 10 forecasting models and how to apply them to predict the evolution of an industrial production index.

bamlss: A Lego Toolbox for Flexible Bayesian Regression. A tutorial on R-bloggers on the preprint arXiv:1909.11784 (see above). Modular R tools for Bayesian regression are provided by bamlss: From classic MCMC-based GLMs and GAMs to distributional models using the lasso or gradient boosting.

Python Data Preprocessing Using Pandas DataFrame, Spark DataFrame, and Koalas DataFrame. One of the major limitations of Pandas is that it was designed for small datasets that can be handled on a single machine and thus it does not scale well to big data. On the contrary, Apache Spark was designed for big data, but it has a very different API and also lacks many of the easy-to-use functionality in Pandas for data wrangling and visualization. Recently a new open source framework Koalas was announced that bridges the gap between Pandas DataFrame and Spark DataFrame by augmenting PySpark’s DataFrame API to make it compatible with pandas DataFrame API. In this post, a public dataset is used to evaluate and compare the basic functionality of Pandas, Spark, and Koalas DataFrames in typical data preprocessing steps for machine learning.

See also