+++ So You Want Git for Data? +++ All models are wrong, but some are completely wrong +++
Breakthrough — Or So They Say
Tools and Frameworks
So You Want Git for Data? Tim Sehn, of Liquidata and Dolthub, has published an article on their blog that supposedly analyzes what “git for data” would actually mean. However, instead of coming up with a clear definition and a list of features, he compares current frameworks and finally concludes that their product Dolt is actually the only product that fits the label. What is Dolt? Dolt is a relational database, i.e. it has tables, and you can execute SQL queries against those tables. It also has version control primitives that operate at the level of table cell. Dolt implements the Git command line and associated operations on table rows instead of files. Data and schema are modified in the working set using SQL. When you want to permanently store a version of the working set, you make a commit. Dolt produces cell-wise diffs and merges, making data debugging between versions tractible. Thus Dolt is a database that supports fine grained value-wise version control, where all changes to data and schema are stored in commit log. It is inspired by RDBMS and Git, and attempts to blend concepts about both in a manner that allows users to better manage, distribute, and collaborate on, data. At least as interesting as the commercial for their own tool is the fact that Tim Sehn is full of praise for Pachyderm, a data versioning tool that is more geared towards large-scale data pipelines on files for machine learning purposes.
Business News and Applications
All models are wrong, but some are completely wrong. I like this article not least for its title. Written by Martin Goodson for the Royal Statistical Society Data Science Section, it laments the fact that, in a time when epidemic forecasting informs policy and even individual decision-making, scientists and journalists have completely failed to communicate how these models work. Epidemiologists are making the same mistakes that the climate science community made a decade ago. A series of crises forced climatologists to learn painful lessons on how (not) to communicate with policy-makers and the public. In 2010 the 4th IPCC report was attacked for containing a single error – a claim that the Himalayan glaciers would likely have completely melted by 2035 (‘Glacier Gate’). Climate denialists and recalcitrant nations such as Russia and Saudi Arabia seized on this error as a way to discredit the entire 3000 page report, which was otherwise irreproachable. By the time of the 5th IPPC report, mechanisms had been developed to enforce clear communication about the uncertainty surrounding predictive models; and transparency about models and data. The infectious disease community needs to learn these lessons. And learn them quickly. Proposed guidelines for scientists and journalists are: (1) express the level of uncertainty associated with a forecast; (2) get quotes from other experts before publishing; (3) describe the critical inputs and assumptions of their models; (4) be as transparent as possible; (5) use multiple models to inform policy; (6) indicate when a model was produced by somebody without a background in infectious diseases.