+++ The Surge of Sensationalist COVID-19 AI Research +++ Peer Reviewing Data Science Projects +++ Model-Based Machine Learning +++ ‘Super Duper NLP Repo’ and ‘Big Bad NLP Database’ +++
Breakthrough — Or So They Say
The Surge of Sensationalist COVID-19 AI Research. In this article, Professor Hamid R. Tizhoosh, who leads University of Waterloo’s KIMIA Lab (Laboratory for Knowledge Inference in Medical Image Analysis), is bashing the flood of hastily published articles on questionable “AI solutions” for COVID-19. Medical imaging is a highly sensitive domain in which a long process is generally required to curate and access a set of labeled images. Needless to say, the curation has to happen within the walls of a hospital not just because the experts are there, but also due to the required de-identification of images to comply with privacy regulations. Tizhoosh highlights that more and more collections of x-ray and partly CT images – scraped from the Internet – are appearing and evolving as the creators continue to add images. Because of the availability of such datasets on one side and the ubiquity of basic AI knowledge and tools on the other side, many AI enthusiasts and start-ups have impulsively begun to develop solutions for COVID-19 in x-ray images. Papers based on these data sets report no validation, and no radiologist has guided the experiments. Many of these works were hurriedly made public before the creators of datasets could even provide sufficient explanation about their collection process. In an attempt to overcome the small data size, AI enthusiasts and start-ups mix the few COVID-19 images with other public datasets, i.e., pneumonia datasets. Tizhoosh urges that the AI community waits for real data from hospitals, does the ethics clearance and de-identification, and works with radiologists to develop solutions for chest issues of the future.
Tools and Frameworks
Business News and Applications
Peer Reviewing Data Science Projects. Data science consultant Shay Palachy has a nice post on KDNuggets that follows up on an earlier post on data science project flows, which he structures into the four phases (1) scoping, (2) research, (3) development, and (4) deployment. Shay is actually proposing two different peer review processes for the two phases he feels require peer scrutiny the most: the research phase and the model development phase. For Shay, the “research phase” is that part in the process where one performs an iterative back-and-forth between data exploration and review of previous or potential future approaches, in order to map out the space of possible methodological approaches to frame and solve the problem at hand. This also involves a scope and KPI validity check, done with the product owner. Obviously, the “model development phase” pretty much does what its name says. While Shay doesn’t go very deep into the question how peer review is actually done in practice, he nevertheless supplies nice detailed checklists for the two phases he focuses most. Actually, the model development checklist is based on this original list by Philip Tannor.
Publications
Model-Based Machine Learning. Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. Christopher M. Bishop, researcher at Microsoft and author of a leading textbook on machine leaning, describes an alternative methodology, in which a bespoke solution is formulated for each new application, in this article and a book draft (also as PDF). The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This framework emerged from a convergence of three key ideas: (1) the adoption of a Bayesian viewpoint, (2) the use of probabilistic graphical models, and (3) the application of fast, deterministic, efficient and approximate inference algorithms. The core idea is that all assumptions about the problem domain are made explicit in the form of a model, which is here understood to be a set of assumptions about the world expressed in a probabilistic graphical format with all the parameters and variables expressed as random components.
Tutorials
‘Super Duper NLP Repo’ and ‘Big Bad NLP Database’. There are 2 major components of a machine learning modeling project of any kind: the data, and the algorithms which learns something from this data. The folks at Quantum Stat, a NY based NLP / AI startup, have recently released two stepping stones that can help anyone interested in NLP pick and choose the right components for his interests and tasks. (1) The Big Bad NLP Database is a collection of nearly 300 well-organized NLP datasets, ready to go for common NLP tasks and needs, such as document classification, question answering, automated image captioning, dialog, clustering, intent classification, language modeling, machine translation, text corpora, and more. (2) The Super Duper NLP Repo, a collection of Colab notebooks covering a wide array of NLP task implementations available to launch in Google Colab with a single click. Notebook entries in the repo include a general description, the notebook’s creator, as well as the task (text classification, text generation, question answering) being considered and the type of model being employed (BERT, GPT-2, convolution neural network, etc.). These notebooks come from across the web, including from expert individuals and organizations.