In today’s data-driven world, understanding the provenance of data is crucial to maintain data quality, accuracy, and trustworthiness. Provenance refers to the origin and history of data, including who created it, when, and how it was transformed along the way. This article proposes a novel approach called Graph Segmentation and Summarization (GSS) to understand data science lifecycle provenance.
The authors, H. Miao and A. Deshpande, explain that traditional methods for data provenance are limited in their ability to handle complex data systems and graph structures. They propose GSS as a solution to this problem by segmenting and summarizing graphs into smaller subgraphs, which can be analyzed independently or combined for a comprehensive understanding of the data’s history.
The authors present two key techniques in GSS: (1) Graph Segmentation using Kronecker Products, and (2) Graph Summarization using Spectral Clustering. These techniques enable the creation of smaller subgraphs that can be analyzed independently or combined for a comprehensive understanding of the data’s history.
The authors demonstrate the effectiveness of GSS through experiments on real-world datasets and compare their results with existing methods. Their findings show that GSS outperforms existing methods in terms of scalability, accuracy, and efficiency.
In summary, this article presents a novel approach called Graph Segmentation and Summarization (GSS) to understand data science lifecycle provenance. GSS segments and summarizes graphs into smaller subgraphs, enabling the creation of a comprehensive understanding of the data’s history. The proposed techniques demonstrate improved scalability, accuracy, and efficiency compared to existing methods. This work has significant implications for maintaining data quality, accuracy, and trustworthiness in today’s complex data systems.