MMMLM: Multi-Model for Machine Learning Metadata
With the rapid and recent rise of data science, the machine learning platforms being built are becoming more complex. You might be aware of the importance of high quality training data for machine learning, but for a production setup the metadata is equally important, including: different versions/datasets, varying versions of jupyter notebooks, different training parameters, test/training accuracy, different features, model serving statistics and many others.
It is critical that we have a common view across all the metadata in a production setup as we need to determine which jupyter notebook has been used to build the model that is currently running in production, if there is new data for a given dataset, and if the models that are currently serving in production need to be updated.
One particular challenge with metadata is choosing the data model. Individual data entities are typically unstructured and would be a perfect fit for a document database, such as the statistics of a training run. As opposed to a scenario where you need to capture the relations between different entities and need to determine which models have been derived from a dataset.
We will explore in this presentation how a multi-model approach can help us to manage ML pipelines at scale, in particular machine learning meta data (MLMD) as part of the TensorFlow ecosystem. We will also propose a first draft of an MLMD-compatible universal Metadata API and demo the first implementation of this API using ArangoDB.
Jörg Schad is Head of Machine Learning at ArangoDB. In a previous life, he has worked on and built machine learning pipelines in healthcare, distributed systems at Mesosphere, and in-memory databases.
He received his Ph.D. for research around distributed databases and data analytics. Jörg is a frequent speaker at Meetups, international conferences and lecture halls.