A toolkit for Natural Language Inference researchers
Role
Industry
Duration
Problem statement
Robustness Gym is a toolkit for Research Scientists to test the robustness of their Natural Language Inference (NLI) models. Despite impressive performance on standard benchmarks, deep neural networks often fail when deployed to real-world systems. Robustness Gym (RG) was created to address these vulnerabilities: a simple and extensible toolkit for research scientists that supports the entire spectrum of evaluation methodologies. I designed the UX and UI and coded most of the front end using React.
UX design and prototyping
The RG interface has five main panes. The left pane (settings) allows the user to select the parameters of their experiment. Results are updated on the fly so there is no need for a “go”. The center pane uses a scatter plot (top) for quick visual comparison of model performance by problem class. The bottom middle pane lets user's column sort for different facets of each subpopulation. The top of the right pane has an overall “robustness score” showing how well the selected item (model or subpopulation) performed. The bottom of the right pane shows the confusion matrices for the different models on the selected subpopulation.
Front end implementation
I built the first draft of the front end using React and Bootstrap. The API is a Python ML agent developed by my colleague at Stanford. I built a Flask test server to mimic the backend because the development was going on concurrently.