In the field of artificial intelligence, data synthesis is a crucial step in creating datasets that can be used to evaluate and improve language models’ (LLMs) reasoning abilities. The article discusses two methodologies for data synthesis: graph data synthesis and linear data synthesis.
Graph Data Synthesis
In graph data synthesis, complexity increases through a series of levels defined by parameters such as the number of vertices, edges, and edge weights. The process involves generating individual graph instances using a generative function that follows graph theory principles, then iteratively producing multiple graphs at each level using a batch synthesis function. Finally, the synthesized graphs are preserved in a tabulated format for subsequent analysis.
Linear Data Synthesis
In linear data synthesis, complexity is modulated by manipulating the length of the data array and its constituent elements’ range. The process begins with shorter arrays with limited element values at lower difficulty levels, gradually increasing to longer arrays with expanded element ranges at higher levels.
By understanding these methodologies, researchers can create diverse datasets suitable for evaluating LLMs’ reasoning abilities, making it possible to demystify complex concepts and improve AI language models.