Evolution of Metric System Architecture


In the past 2 years, I spent about 70% of my working time to build, to break, and to fix data products. This article is a brief retrospect of my understanding on building the whole systems, as well as what kind of tools could be plugged as components.

Goal of a data System

We use data to understand reality and improve our product. This is the primary goal of a data/metric system. A good data system answers question, a better data system identifies root causes, and an even better data system help improve the whole system directly.

Use cases

In Yahoo!, the data platform I am working on mainly support a Personalization System (Recommendation system). During the iteration of the recom system, we follow and forecast what would be the actual use cases for the team to understand or to improve the Recom system. The major use cases for our system includes:

  • Understand system performance with reports from different key metrics
  • Detect / identify metric abnormal / data pipeline failure
  • Collect user feedback data to improve system online in short cycle
  • Make it easy for PM/Dev/Scientist to play with data

For different stage, we focus on different aspect and use different tools / techniques to solve problems. Let me illustrate.

Read More

Apache Pig in Practice 1

I write many pig script in the past few months and have explored some tricks with my buddies. hopes it could help someone.

Let’s focus on some interesting topics in this first article and get prepared for the later Pig rush.

IDE & Environment


I use Vim to write most script language and those are my favourite plugins to write Pig:

  • Pig Syntax Highlight. Latest update on Jun 2014, Pig 0.12 supported.
  • You complete me. Best auto-complete plugins ever. If you don’t use a MAC, Supertab is also a reasonable choice.
  • Tabularize Align and keep cleaness of the Pig codelet. Most common usage is :Tab/AS to align FOREACH ... GENERATE clause.

To improve debug efficiency, I like to run pig with short cut. Here are my simple approach: add the following in .vimrc for quick run with F5

map <F5> :call Compile_Run()<CR>
function Compile_Run()
if &filetype=="coffee"
!coffee % 2>&1
elseif &filetype=="cpp"
!g++ -g -o %< %; ./%<
elseif &filetype=="python"
!python %
elseif &filetype=="pig"
!./run_pig.sh %

revise run_pig.sh as you like. General idea is reduce redundant work and typo.

Read More