Research Interest

My research interests lie in data management and analytics. Specifically, my thesis focus is on collaborative data management and manipulation, i.e., efficient management and manipulation of collaborative datasets. I'm also broadly interested in machine learning and interactive data analytics. In addition, I have research experiences in approximate query processing in Database, automate machine learning, and text mining using deep learning during my internships in Microsoft research and Google research. Please check out here.

Selected Research Projects

OrpheusDB [paper] [slides] [post] [website]

OrpheusDB is a hosted platform to bolt-on versioning for traditional relational databases. With the increasing popularity of collaborative data analytics, hundreds of thousands of versions are acquired or constructed at various stages of data analysis across many collaborative users, and often over long periods of time. Compared to the existing multi-version control systems like GitHub, OrpheusDB has the following two crucial advantages: (a) compact storage; (b) rich query language. Since OrpheusDB is built on top of PostgreSQL, it inherits much of the benefits of relational databases, while also compactly storing, keeping track of, and recreating versions on demand. In our current implementation, user can interact with OrpheusDB via the command line using both standard Git-style APIs and SQL-like query language. Please refer to our website for more details.

GenVisage [preprint] [webserver]

Genvisage is on rapid identification of discriminative features for genomic data analysis. Given two different classes of objects and an object-feature matrix, our goal is to find the TOP-K feature pairs separating these two classes. Many biological applications fit in this framework. For instance, when exposed to some drug, one set of genes may be overly expressed, while the other set of genes are not. Given the result for such a drug response experiment, biologists often want to find features to characterize these differentially expressed genes. Our design principle is to prioritize running time over accuracy, serving as a data exploration tool before investing in more time-consuming methods. We first propose a Rocchio-based separability metric for a given feature pair. Furthermore, we have been developing a suit of optimization strategies to reduce the running time in finding best feature pairs. Try our webserver!

CataMaran [paper]

CataMaran is a tool that extracts structure from semi-structured log datasets with no human supervision. Catamaran automatically identifies field and record endpoints, separates the structured parts from the unstructured noise or formatting, and can tease apart multiple structures from within a dataset, in order to efficiently extract structured relational atasets from semi-structured log datasets, at scale with high accuracy. Compared to other unsupervised log dataset extraction tools developed in prior work, Catamaran does not require the record boundaries to be known beforehand, making it much more applicable to the noisy log files that are ubiquitous in data lakes.

Principles of Dataset Versioning [paper]

In this project, we study the fundamental problem of dataset versioning, i.e., storage-recreation trade-off, which arises with the proliferation of many hundreds or thousands of versions of the same datasets in many scientific and commercial domains. In particular, the challenge can be stated as follows: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. We study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delayconstrained scheduling, and spanning tree literature, to solve these problems.

Ongoing Research Projects

Reconnect Datasets in a Working Repository

Data scientists typically refine data transformation and feature engineering process in a trial-and-error manner, generating various data artifacts as a result. However, in practice little or no lineage information is maintained upon each artifact's generation, hindering future developmental insights, data sharing and discovery, or even the reproducibility of analytical results. In this project, we aim to “reconnect” the versioned datasets, presenting the user with a summarized view among datasets. In particular, to describe the relationship between datasets pairs, we design delta representations at various granularity and demonstrate its effectiveness by examining the common operations in crawled Python Notebook. Furthermore, we propose to efficiently compute the delta representations by exploiting sketch and sampling techniques.