Data Manipulation using Programming By Examples and Natural Language
Sumit Gulwani
Microsoft Research


Data is locked up in semi-structured formats (such as spreadsheets, text/log files, webpages, pdf documents) in both consumer and enterprise space. Getting data out of these documents into structured formats that allow the data to be explored and analyzed is time consuming and error prone. While data scientists typically spend 80% of their time cleaning data, programmatic solutions to data manipulation are beyond the expertise of 99% of those end users who do not know programming.

The paradigms of programming by examples (PBE) and programming by natural language (PBNL) have the potential to make data wrangling a delightful experience for the masses. In order to bring PBE and PBNL technologies to market, two key technical challenges need to be addressed: (a) developing efficient search algorithms that can explore the huge state space of programs to find those that match the user specification, and (b) developing effective ambiguity resolution techniques to deal with the inherent ambiguity in the user specification. Our state-of-the-art search algorithms employ deductive reasoning and domain-specific languages that restrict search space to achieve real-time efficiency. Our ambiguity resolution techniques include machine learning based ranking of synthesized programs, support for navigation between synthesized programs paraphrased into structured English, and active learning based interaction models. In this talk, I will demo few technologies that have been developed using these principles. Some of these technologies have also been shipped as part of major Microsoft products.


Sumit Gulwani is a principal researcher at Microsoft Research and an adjunct faculty in the Computer Science Department at IIT Kanpur, India. His research interests lie in the cross-disciplinary areas of automating end-user programming and building intelligent tutoring systems. He is a recipient of the ACM SIGPLAN Robin Milner Young Researcher Award. He obtained his PhD from UC-Berkeley in 2005, and was awarded the ACM SIGPLAN Outstanding Doctoral Dissertation Award. He obtained his BTech from IIT Kanpur, and was awarded the President’s Gold Medal.