+++ LinkedIn is open sourcing its metadata search and discovery platform DataHub +++ Google Brain and DeepMind researchers attack reinforcement learning efficiency +++ The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence +++ Scaling Knowledge at Airbnb +++
Breakthrough — Or So They Say
Tools and Frameworks
LinkedIn is open sourcing its metadata search and discovery platform DataHub. The trend towards building ML platforms and amassing ever larger data assets naturally begs the question: what is the best method for internal discovery of ML features, models, metrics, datasets, etc.? LinkedIn has develop its own platform DataHub for that purpose and open sourced it this week on GitHub. DataHub consists of two distinct stacks: a Modular UI frontend and a Generalized Metadata Architecture backend. At LinkedIn, DataHub currently stores and indexes tens of millions of metadata records that encompass 19 different entities, including datasets, metrics, jobs, charts, AI features, people, and groups. They also plan to onboard metadata for machine learning models and labels, experiments, dashboards, microservice APIs, and code in the near future. DataHub provides two forms of metadata ingestion: either through direct API calls or a Kafka stream.
Business News and Applications
Publications
Google Brain and DeepMind researchers attack reinforcement learning efficiency. Reinforcement learning is notoriously data and resource intensive. An article in VentureBeat features two recent preprints by Google Brain and DeepMind that prototype more efficient ways of executing it.
The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence (arXiv:2002.06177). The latest opinionated publication by Gary Marcus. Quote: “Recent research in artificial intelligence and machine learning has largely emphasized general-purpose learning and ever-larger training sets and more and more compute. In contrast, I propose a hybrid, knowledge-driven, reasoning-based approach, centered around cognitive models, that could provide the substrate for a richer, more robust AI than is currently possible.”
Tutorials
Scaling Knowledge at Airbnb. This article is already a few years old, but still shows nicely how Airbnb Engineering scales knowledge internally through a ‘Knowledge Repo’. At the core there is a Git repository, to which work is committed. Posts are written in Jupyter notebooks, Rmarkdown files, or in plain Markdown, but all files (including query files and other scripts) are committed. GitHub’s pull request system is used for the review process. Finally, there is a Flask web-app that renders the Repo’s contents as an internal blog, organized by time, topic, or contents. On top of these tools, there is a process that combines the code review of engineering with the peer review of academia, wrapped in tools to make it all go at startup speed. As in code reviews, they check for code correctness and best practices and tools. As in peer reviews, they check for methodological improvements, connections with preexisting work, and precision in expository claims. Posts typically don’t cover every corner of investigation, but instead provide quick iterations that are correct and transparent about their limitations. The bulk of the work consists of deep-dives aimed at answering non-trivial questions, but other posts are written purely to broaden the base of people who do data analysis, including writeups of a new methodology, example of a tool or package, and tutorials on SQL and Spark.