In the past 2 years, I spent about 70% of my working time to build, to break, and to fix data products. This article is a brief retrospect of my understanding on building the whole systems, as well as what kind of tools could be plugged as components.
Goal of a data System
We use data to understand reality and improve our product. This is the primary goal of a data/metric system. A good data system answers question, a better data system identifies root causes, and an even better data system help improve the whole system directly.
In Yahoo!, the data platform I am working on mainly support a Personalization System (Recommendation system). During the iteration of the recom system, we follow and forecast what would be the actual use cases for the team to understand or to improve the Recom system. The major use cases for our system includes:
- Understand system performance with reports from different key metrics
- Detect / identify metric abnormal / data pipeline failure
- Collect user feedback data to improve system online in short cycle
- Make it easy for PM/Dev/Scientist to play with data
For different stage, we focus on different aspect and use different tools / techniques to solve problems. Let me illustrate.