Program synthesis for automated data wrangling is the application of synthesis techniques to automatically generate executable scripts or queries—such as for SQL, Pandas, or regular expressions—that transform raw, messy data into a clean, analysis-ready format. The specification is typically provided via input-output examples (e.g., a user demonstrates a few row transformations), natural language descriptions, or constraints on the desired output schema. This automates the tedious, error-prone process of data cleaning and feature engineering, directly translating user intent into correct, executable code.
Glossary
Program Synthesis for Automated Data Wrangling

What is Program Synthesis for Automated Data Wrangling?
A subfield of program synthesis focused on automatically generating code to clean, transform, and integrate raw data.
Core techniques include Programming by Example (PBE), as seen in tools like Microsoft Excel's FlashFill, and neurosymbolic methods that combine neural networks for interpreting ambiguous intent with symbolic solvers to guarantee logical correctness. The synthesizer searches a space defined by a Domain-Specific Language (DSL) of data manipulation primitives to find a program that satisfies all given specifications. This enables reproducible, auditable data pipelines and significantly reduces the time data scientists and engineers spend on data preprocessing.
Core Technical Approaches
Program synthesis automates the creation of data transformation scripts by inferring intent from high-level specifications. This glossary details the core technical paradigms that power these systems.
Programming by Example (PBE)
A synthesis paradigm where the user provides concrete input-output pairs, and the system infers a general program that satisfies all examples. This is highly intuitive for data cleaning tasks.
- Example: In a spreadsheet, showing that '2023-01-15' should become 'Jan 15, 2023'.
- Key System: FlashFill, integrated into Microsoft Excel, popularized this approach for string transformations.
- Challenge: Requires robust generalization from few examples to handle unseen data variations.
Syntax-Guided Synthesis (SyGuS)
A formal framework that constrains the program search space using a context-free grammar and verifies candidates against a logical specification. It brings rigor to synthesis for data wrangling.
- Grammar: Defines the allowed operations (e.g., string functions, regex patterns, arithmetic).
- Solver-Based: Often uses Satisfiability Modulo Theories (SMT) solvers like Z3 to find a correct program.
- Use Case: Generating a correct regular expression or a SQL query from a formal description of the desired output format.
Neurosymbolic Synthesis
A hybrid approach combining neural networks for learning from ambiguous inputs (like natural language descriptions) with symbolic reasoning to ensure logical correctness. This is ideal for translating user intent into precise code.
- Neural Component: Interprets a user's request, e.g., 'extract the product code from the description'.
- Symbolic Component: Searches a space of valid Pandas or SQL operations to construct a program that fulfills the interpreted intent.
- Benefit: Bridges the flexibility of learning with the guarantees of formal methods.
Sketch-Based Synthesis
A technique where the user provides a partial program (a sketch) with intentional holes, and the synthesizer fills these holes with code fragments. This balances user control with automation.
- Sketch Example:
df['new_col'] = df['col_a']?df['col_b']where?is a hole for an operator. - Synthesizer's Role: Searches for operators (e.g.,
+,CONCAT(),-) that satisfy given constraints or examples. - Advantage: Allows domain experts to guide the synthesis using their knowledge of the required program structure.
Large Language Model (LLM) Based Synthesis
Uses foundation models like GPT-4 or Code Llama, prompted with few-shot examples or instructions, to generate data wrangling code directly. This represents a shift towards leveraging vast pre-trained knowledge.
- Method: Prompting with natural language (e.g., 'Write a Python function to clean phone numbers') and optionally including input-output examples.
- Strengths: Exceptional flexibility and ability to handle vague or complex specifications.
- Limitations: Outputs are probabilistic and require validation; lack formal correctness guarantees without additional verification steps.
Domain-Specific Language (DSL) Synthesis
The automatic creation of programs within a custom, tailored language whose primitives are designed for a specific data wrangling domain. This dramatically narrows the search space.
- DSL Examples: A language built only for table transformations, time-series alignment, or JSON normalization.
- Process: The synthesizer searches over combinations of DSL primitives to meet the specification.
- Benefit: Increases synthesis speed and success rate by eliminating irrelevant general-purpose code constructs from consideration.
Frequently Asked Questions
Program synthesis automates the creation of data transformation scripts from high-level specifications. This FAQ addresses how it works, its applications, and its role in modern data engineering and agentic systems.
Program synthesis for automated data wrangling is the application of automatic code generation techniques to produce scripts or queries that clean, transform, and integrate raw, unstructured data into an analysis-ready format. It translates a user's intent—expressed through examples, constraints, or natural language—into executable code in languages like SQL, Pandas, or for crafting regular expressions. For instance, a user could demonstrate a few examples of how to split a 'Full Name' column into 'First' and 'Last' columns, and the synthesizer would generate a generalized Pandas function to apply the transformation to the entire dataset.
This process is a core component of agentic cognitive architectures, enabling autonomous AI agents to perform complex data preparation tasks without manual coding. By treating data transformation as a program synthesis problem, systems can generate precise, reusable, and verifiable data pipelines.
Enabling Efficiency, Speed & Accuracy
Intelligent Analysis, Decision & Execution
We build AI systems for teams that need search across company data, workflow automation across tools, or AI features inside products and internal software.
Talk to Us
Search across company data
Give teams answers from docs, tickets, runbooks, and product data with sources and permissions.
Useful when people spend too long searching or get different answers from different systems.

Automate internal workflows
Use AI to route work, draft outputs, trigger actions, and keep approvals and logs in place.
Useful when repetitive work moves across multiple tools and teams.

Add AI to products and internal tools
Build assistants, guided actions, or decision support into the software your team or customers already use.
Useful when AI needs to be part of the product, not a separate tool.
Related Terms
Program synthesis for data wrangling intersects with several key areas of automated software engineering and data processing. These related concepts define the broader ecosystem of tools and techniques for generating executable data transformation logic.
Data Transformation
The broader process of converting data from one format or structure into another. Program synthesis automates the discovery of the transformation logic itself.
- Common tasks: Cleaning (handling missing values, standardizing formats), reshaping (pivoting, melting), enriching (joining with other datasets), and normalizing.
- Synthesis vs. Manual Coding: Synthesis infers the transformation rule from a specification, whereas manual coding requires the user to explicitly write the rule.
Inductive Synthesis
A general synthesis approach that infers a general rule (the program) from specific observations (examples or traces). Counterexample-Guided Inductive Synthesis (CEGIS) is a powerful algorithm in this family.
- CEGIS Loop: 1) Synthesize a candidate program from examples. 2) Verify it against a formal spec. 3) If verification fails, a counterexample is added to the example set, and the loop repeats.
- Application: Used to synthesize data parsers and cleaners with formal correctness guarantees.

About the author
Prasad Kumkar
CEO & MD, Inference Systems
Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.
His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.
Partnered with leading AI, data, and software stack.
How We Work
Custom AI workflows for your Business
One-fit-all AI don't work for modern businesses. At Inferensys, we aim to understand your business & custom requirements; which we use to define most efficient agentic workflows, the data, and the tools for your business.
01
Review the use case
We understand the task, the users, and where AI can actually help.
Read more02
Pick the right approach
We define what needs search, automation, or product integration.
Read more03
Build the first useful version
We implement the part that proves the value first.
Read more04
Improve from there
We add the checks and visibility needed to keep it useful.
Read moreThe first call is a practical review of your use case and the right next step.
Talk to Us