+++ Five Interesting Data Engineering Projects +++ A new model and dataset for long-range memory +++ Building a Backtesting Service to Measure Model Performance at Uber-scale +++
Breakthrough — Or So They Say
Tools and Frameworks
Five Interesting Data Engineering Projects. A nice post on Medium that highlights five projects every data engineer should know about: (1) dbt, a development environment that leverages SQL and YAML to create data transformations that are modular, documented, version controlled. (2) Prefect is an up-and-coming challenger to AirFlow that arguably has a cleaner and more intuitive API. (3) Dask is flexible library for parallel computing in Python, providing advanced parallelism for analytics, enabling performance at scale for tools such as Numpy, Pandas, and Scikit-Learn. (4) DVC provides version control for ML models and data sets, where all workflow versions are tracked, along with all the data artifacts and models, as well as associated metrics. (5) Great Expectations is a really nice Python library that allows you to declare rules to which you expect certain datasets to confirm, and validate those as you encounter (produce or consume) those datasets.
Business News and Applications
Publications
A new model and dataset for long-range memory. In their latest blog post, DeepMind introduces a new long-range memory model, the Compressive Transformer (arXiv:1911.05507), alongside a new benchmark for book-level language modelling, PG19. The post features a brief history of memory in deep learning, remarks on modelling natural language, relevant benchmark datasets, and finally introduces the Compressive Transformer.
Tutorials
Building a Backtesting Service to Measure Model Performance at Uber-scale. Uber has developed their own backtesting service for measuring forecast model error rates. This service benefits them two-fold: it builds trust with internal business users and empowers data scientists to make targeted improvements to forecast predictions. The service supports all Python-based time-series forecasting models, including multivariate models. The backtesting service runs in a distributed system, allowing multiple models (>10), many backtesting windows (>20), and models for different cities (>200) to run simultaneously. For the purposes of their backtesting service, Uber engineers chose to leverage two primary backtesting data split mechanisms, backtesting with an expanding window and backtesting with a sliding window — see also their earlier post on Omphalos, a system that is now superseded by the new service. Their blog post gives some insight into the architecture and a typical backtesting workflow at Uber.