Robustness Gym: real world NLP benchmarking

A toolkit for Natural Language Inference researchers

Role

Principal UX/UI Designer / PM

Industry

Research and Development

Duration

3 months

Problem statement

Robustness Gym is a toolkit for Research Scientists to test the robustness of their Natural Language Inference (NLI) models. Despite impressive performance on standard benchmarks, deep neural networks often fail when deployed to real-world systems. Robustness Gym (RG) was created to address these vulnerabilities: a simple and extensible toolkit for research scientists that supports the entire spectrum of evaluation methodologies. I designed the UX and UI and coded most of the front end using React.

UX design and prototyping

The RG interface has five main panes. The left pane (settings) allows the user to select the parameters of their experiment. Results are updated on the fly so there is no need for a “go”. The center pane uses a scatter plot (top) for quick visual comparison of model performance by problem class. The bottom middle pane lets user's column sort for different facets of each subpopulation. The top of the right pane has an overall “robustness score” showing how well the selected item (model or subpopulation) performed. The bottom of the right pane shows the confusion matrices for the different models on the selected subpopulation.

Front end implementation

I built the first draft of the front end using React and Bootstrap. The API is a Python ML agent developed by my colleague at Stanford. I built a Flask test server to mimic the backend because the development was going on concurrently.

View in portfolio