+++ Improved protein structure prediction using potentials from deep learning +++ Microsoft open sources BERT optimizations in ONNX runtime +++ Introducing the Hopsworks Feature Store – the Data Warehouse for Machine Learning +++ Why Google thinks we need to regulate AI +++ A Look Back on Baidu’s AI Innovations in 2019 +++ Kernel-Based Ensemble Learning in Python +++ An introduction to audio data analysis using Python +++ Spark in Docker in Kubernetes: A practical approach for scalable NLP +++
Breakthrough — Or So They Say
Improved protein structure prediction using potentials from deep learning. Proteins are large, complex molecules essential to all of life. Nearly every function that our body performs—contracting muscles, sensing light, or turning food into energy—relies on proteins, and how they move and change. What any given protein can do depends on its unique 3D structure. Protein structure prediction can be used to determine the 3D shape of a protein from its amino acid sequence. This problem is of fundamental importance as the structure of a protein largely determines its function; however, protein structures can be difficult to determine experimentally. A study by DeepMind, published in Nature and featured on the DeepMind blog, now introduces AlfaFold which generates 3D models of proteins that are far more accurate than any that have come before—marking significant progress on one of the core challenges in biology.
Tools and Frameworks
Microsoft open sources BERT optimizations in ONNX runtime. Microsoft has announced in a blog post this week that it has integrated an optimized implementation of BERT (Bidirectional Encoder Representations from Transformers) with the open source ONNX Runtime. Developers can take advantage of this implementation for scalable inferencing of BERT at an affordable cost. In 2018, Google had open sourced BERT as a new technique for pre-training NLP models, revolutionizing the conversational user experience domain by empowering developers to train their own state-of-the-art question answering system and a variety of other natural language-based models. Recently, Microsoft has shared how Bing has improved BERT inference on GPU for its real-time service needs, serving more than one million BERT inferences per second within Bing’s latency limits. An enhanced versions, working on both GPU and CPU, was now open sourced into the ONNX Runtime on github. The Open Neural Network Exchange (ONNX) is an open standard format for representing machine learning models, making it possible to exchange models between frameworks such as Tensorflow, PyTorch or MXNet. It lives on github and comes with tutorials and a model zoo.
Introducing the Hopsworks Feature Store – the Data Warehouse for Machine Learning. Logical Clocks AB are announcing the release of a Feature Store as part of Hopsworks version 0.8.0. The Feature Store is a central vault for documented, curated, and access-controlled features. In-house Feature Stores are already successfully in production at companies such as Uber, LinkedIn, Airbnb, and Comcast. Now, for the first time, a Feature Store is available, as open-source, in an Enterprise Data platform, Hopsworks. The Feature Store solves the data access and feature management problem for Data Science by removing the need for Data Scientists to constantly re-implement feature pipelines for collecting and transforming data to feed their machine learning models. Instead, Data Scientists can select features from the Feature Store to generate clean training data that can then be consumed directly by machine learning models. Hopsworks’ Feature Store. builds on Apache Spark and Apache Hive to enable it to scale to massive data volumes.
Business News and Applications
Why Google thinks we need to regulate AI. Google and Alphabet CEO Sundar Pichai has laid out his case for greater regulation of AI. In an op-ed for the Financial Times, he highlighted the perils of “nefarious uses of facial recognition” as well as deepfakes. His message to policymakers: “Sensible regulation must also take a proportionate approach” to artificial intelligence, “balancing potential harms with social opportunities. There is no question in my mind that artificial intelligence needs to be regulated. […] The question is how best to approach this.” Currently, US and EU plans for AI regulation seem to be diverging. While the White House is advocating for light-touch regulation that avoids “overreach” in order to encourage innovation, the EU is considering more direct intervention, such as a five-year ban on facial recognition. Pichai called for lawmakers to get on the same page. “International alignment will be critical to making global standards work,” he wrote. “To get there, we need agreement on core values.” Simultaneously, the pitch downplays any negatives that might cloud the greater good that Pichai implies AI will unlock. And, not surprisingly, there is no hint at the concentration of monopoly power that AI appears to be very good at supporting.
A Look Back on Baidu’s AI Innovations in 2019. Baidu highlights some of their key achievement in 2019 in a post on their research blog. Baidu ranked No.1 in the number of AI-related patent applications in China for the second consecutive year, filing a total of 5,712 patents as of October 2019. Deep learning (1,429), natural language processing (938), and speech recognition (933) account for almost 60 percent of those patent applications. Among the technologies and applications they highlight are their NLP framework Ernie, hardware accelerators for AI workloads, their presence in quantum computing, visual perception modules for autonomous vehicles, and their open-sourced deep learning platform PaddlePaddle.
Publications
Kernel-Based Ensemble Learning in Python. The open access journal Information is publishing a special issue “Machine Learning with Python”. One of the articles features a new Kernel method. Here’s and extract from the abstract: “We propose a new supervised learning algorithm for classification and regression problems where two or more preliminary predictors are available. We introduce KernelCobra, a non-linear learning strategy for combining an arbitrary number of initial predictors. KernelCobra builds on the COBRA algorithm, which combined estimators based on a notion of proximity of predictions on the training data. While the COBRA algorithm used a binary threshold to declare which training data were close and to be used, we generalise this idea by using a kernel to better encapsulate the proximity information. Such a smoothing kernel provides more representative weights to each of the training points which are used to build the aggregate and final predictor, and KernelCobra systematically outperforms the COBRA algorithm. While COBRA is intended for regression, KernelCobra deals with classification and regression. Numerical experiments were undertaken to assess the performance (in terms of pure prediction and computational complexity) of KernelCobra on real-life and synthetic datasets.”
Tutorials
An introduction to audio data analysis using Python. This article on Towards Data Science provides a step-wise guide to start with audio data processing, or sound analysis, along with Python code. Topics covered are reading audio files, Fourier and Fast Fourier transforms, spectrograms and, finally, performing speech recognition using spectrogram features.
Spark in Docker in Kubernetes: A practical approach for scalable NLP. The processing of written language can be very complex and can take a long time without scalable architecture. Instead of going for ever faster processors, it is better to choose an architecture which can distribute the computing load over several machines. This article on Towards Data Science, as the title says, leverages Spark in Docker on Kubernetes, deployed on the Google Cloud platform. However, with slight alterations the code should also run on a local machine or cluster of machines.