Inferensys

Integration

AI Integration for Informatica Data Catalog

A technical blueprint for data governance teams to augment Informatica Enterprise Data Catalog (EDC) with LLMs, automating metadata enrichment, glossary curation, and compliance tagging.
Enterprise integration architect reviewing API connections on laptop, diagram showing systems connecting, modern office setup.
ARCHITECTURE BLUEPRINT

Where AI Fits into Informatica EDC

A practical guide to embedding AI agents and LLMs into Informatica Enterprise Data Catalog (EDC) workflows for automated metadata enrichment, governance, and discovery.

AI integration for Informatica EDC focuses on three core surfaces: the metadata ingestion pipeline, the business glossary, and the data lineage graph. Instead of replacing EDC, AI agents act as co-pilots that listen for new asset discovery events, analyze technical metadata and sample data, and then propose enrichments via EDC's REST API or CLI. This allows teams to automate the generation of column descriptions, PII classification tags, suggested business terms, and data quality rule recommendations as assets are cataloged, turning a passive inventory into an intelligent, self-documenting system.

Implementation typically involves a lightweight service that subscribes to EDC's event framework (e.g., ASSET_DISCOVERED). For each new table or file, the service uses an LLM to analyze the asset's name, column names, data types, and a statistical sample. It then generates structured proposals: a technical summary for the asset description, confidence-scored PII classifications (like PII.Email), and suggested mappings to existing glossary terms. These proposals are posted back to EDC as draft suggestions, requiring steward approval via EDC's UI or a separate workflow tool, ensuring human-in-the-loop governance. This pattern keeps the catalog's authority intact while dramatically accelerating its population and accuracy.

For rollout, start with a pilot on a single, high-value data domain—such as customer or product data. Configure the AI service to only propose enrichments for assets tagged with that domain. Use EDC's custom attributes and workflow capabilities to create an approval queue for AI suggestions. This controlled approach mitigates risk, allows for tuning of prompts and confidence thresholds, and builds trust with data stewards. Over time, the system can be expanded to automate more complex tasks, like using the enriched lineage graph to generate impact analysis reports in natural language or to identify undocumented data dependencies for migration projects. For a deeper dive on governing these automated workflows, see our guide on AI Integration for Informatica Data Governance.

WHERE LLMS CAN ENRICH METADATA AND GOVERNANCE

Key Integration Surfaces in Informatica EDC

Automating Technical Metadata Generation

LLMs can be integrated into Informatica EDC's discovery and profiling workflows to generate human-readable summaries of data assets. When EDC scans new databases, files, or APIs, an AI agent can process the raw schema information—table names, column data types, and sample values—to produce concise descriptions.

Example Workflow:

  1. EDC's discovery job completes, registering new assets.
  2. A webhook triggers an AI service with the asset's technical metadata.
  3. The LLM generates a suggested business_description for the table and column_definition for key fields.
  4. The enriched metadata is posted back to EDC's REST API (/api/v2/catalog/assets/{id}) for steward review or auto-approval.

This reduces the time data stewards spend on manual documentation, accelerating catalog usability.

AUTOMATE METADATA MANAGEMENT

High-Value AI Use Cases for EDC

Transform Informatica Enterprise Data Catalog from a passive inventory into an intelligent, self-enriching system. These AI integration patterns automate the most manual and high-value metadata workflows.

01

Automated Technical Metadata Summaries

Use LLMs to read table DDL, column names, and sample data, then generate plain-English descriptions for assets in the catalog. Workflow: Trigger on asset discovery or update, call LLM API with schema context, write summary back to EDC via REST API. Value: Eliminates manual documentation backlog, making the catalog instantly useful for new data consumers.

Weeks -> Hours
Documentation backlog
02

Business Glossary Term Suggestion

Analyze column names, data profiles, and existing glossary to suggest new terms and map assets automatically. Workflow: AI reviews unclassified columns, proposes term definitions and relationships, presents to stewards for approval in EDC UI. Value: Accelerates governance programs and improves term coverage without exhaustive manual review.

80%+
Reduction in manual mapping
03

PII and Sensitive Data Identification

Augment pattern-based scanners with LLM context to identify non-standard PII, sensitive narratives in comment fields, and inferred data classes. Workflow: AI analyzes sample data and metadata, assigns confidence-scored classifications, and tags assets in EDC for policy enforcement. Value: Closes compliance gaps missed by regex rules, especially in unstructured or free-text fields.

Batch -> Continuous
Classification scan
04

Data Lineage Gap Analysis & Enrichment

Use AI to infer missing lineage links by analyzing job names, SQL logs, and data movement patterns, suggesting connections for steward review. Workflow: AI processes EDC lineage graphs and operational metadata, proposes probable missing edges, integrates approved links. Value: Creates more complete, trustworthy lineage for impact analysis and regulatory reporting.

1 sprint
To close critical gaps
05

Natural Language Catalog Search & Q&A

Deploy a RAG-powered agent over EDC's metadata, allowing users to ask 'What tables contain customer revenue for the EU region?' in plain language. Workflow: Vectorize EDC metadata, embed in a vector DB, use LLM to interpret query and retrieve relevant assets. Value: Drastically reduces time for data discovery, especially for non-technical business users.

Minutes -> Seconds
Data discovery time
06

Stewardship Workflow Automation

Automate ticket creation, assignment, and escalation for data quality issues, term approval requests, and certification workflows detected by AI. Workflow: AI monitors data quality scores and user requests, creates and routes tasks in EDC, follows up via email integration. Value: Ensures governance processes are executed, not just documented, improving data trust.

Same day
Issue resolution SLA
IMPLEMENTATION PATTERNS

Example AI-Augmented Workflows

These workflows illustrate how LLM-powered agents can automate high-effort, manual tasks within Informatica Enterprise Data Catalog (EDC), turning passive metadata into active intelligence. Each pattern connects to EDC's APIs and data model to drive measurable efficiency gains.

Trigger: A new data asset (table, file, API endpoint) is discovered and ingested into EDC.

Context Pulled: The agent retrieves the asset's raw metadata from EDC's REST API: object name, column names, data types, and sample data (if profiling is enabled).

Agent Action: An LLM analyzes the column names, data types, and sample values to generate:

  • A concise, business-friendly description of the asset's purpose.
  • Descriptive, plain-language explanations for each column (e.g., cust_id → "Unique identifier for the customer record, used as the primary key in the CRM system").
  • Inferred data classifications (e.g., PII, Financial, Operational).

System Update: The agent uses EDC's API to write the generated descriptions and suggested classifications back to the asset's metadata properties. It can also create or link to suggested business glossary terms.

Human Review Point: Suggested PII classifications and new glossary terms are placed in a stewardship queue within EDC for a data owner to review and approve before final application.

ENRICHING ENTERPRISE DATA CATALOGS

Implementation Architecture & Data Flow

A practical blueprint for integrating LLMs with Informatica Enterprise Data Catalog (EDC) to automate metadata enrichment and governance workflows.

The integration connects to Informatica EDC's REST API and metadata database to process discovered assets. A typical flow begins by extracting technical metadata—table names, column definitions, and data lineage—for assets lacking business context. This raw metadata is sent to an LLM service (like Azure OpenAI or Anthropic) via a secure, queued API layer. The LLM is prompted to generate human-readable summaries, suggest business glossary terms, and identify potential PII patterns based on column names, sample values, and existing catalog classifications.

Generated enrichments are returned to a governance service that applies validation rules and, if configured, routes suggestions to designated data stewards in Informatica Axon for approval via webhook. Approved metadata is then written back to EDC, updating asset descriptions, custom properties, and tagging PII columns with appropriate classifications. This creates a closed-loop system where AI suggestions improve catalog quality, which in turn trains better prompts and fine-tunes models on your specific data landscape.

For rollout, we recommend starting with a pilot on a single business domain or data source. Implement the integration as a scheduled batch job (e.g., nightly) to avoid impacting EDC performance. Key governance controls include: logging all AI-generated content with a source:ai_enrichment tag, maintaining an audit trail of changes, and setting confidence thresholds for auto-application versus steward review. This architecture ensures the catalog becomes a living, AI-augmented system of record, dramatically reducing the manual effort of curating technical metadata at scale.

AUTOMATING METADATA ENRICHMENT

Code & Payload Examples

Automating Business Term Creation

Use LLMs to analyze technical column names and sample data from Informatica EDC, generating candidate business terms and definitions. This automates the initial population of the business glossary, which stewards can then review and approve.

A common pattern is to trigger this enrichment after a new data source is profiled. The payload to the LLM includes the asset name, column metadata, and a few sample values for context. The response is formatted to create or update glossary objects via the Informatica EDC REST API.

python
# Example: Generate business term suggestions for a discovered column
import openai
import requests

# Fetch column metadata from Informatica EDC API
column_metadata = get_edc_column_metadata(connection_id="conn_123", column_name="cust_acct_num")

prompt = f"""
Given this database column metadata, suggest a business glossary term and definition.
Column Name: {column_metadata['name']}
Data Type: {column_metadata['type']}
Sample Values: {column_metadata['samples']}

Respond in JSON: {{"term": "suggested term", "definition": "clear definition"}}
"""

response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)
suggestion = json.loads(response.choices[0].message.content)

# Post suggestion to EDC API for steward review
requests.post(
    f"{EDC_BASE_URL}/api/v2/glossary/terms/draft",
    json={
        "name": suggestion['term'],
        "definition": suggestion['definition'],
        "associatedAssets": [column_metadata['id']]
    }
)
AI-ENRICHED DATA CATALOG OPERATIONS

Realistic Time Savings & Operational Impact

How LLM integration transforms manual metadata management and data discovery workflows within Informatica Enterprise Data Catalog (EDC).

Data Catalog TaskManual ProcessAI-Assisted ProcessImplementation Notes

Technical Metadata Summarization

Analyst writes 5-10 minute descriptions per asset

LLM generates draft summaries in seconds

Human curator reviews & refines; scales to 1000s of assets

Business Glossary Term Suggestion

Stewards manually review data to propose terms

AI scans column names & sample data to suggest candidate terms

Stewards approve, reject, or modify; reduces initial review by 60-70%

PII & Sensitive Data Identification

Rule-based scans plus manual sampling for context

LLM analyzes unstructured comments & data patterns to flag potential PII

Combines with existing scanners; catches context-dependent sensitive data

Data Lineage Documentation

Manual interviews and spreadsheet mapping for business logic

AI parses SQL & ETL code to infer and draft business-friendly lineage notes

Accelerates initial documentation; data architect validates connections

Stale Asset Identification & Triage

Periodic manual reports on last access date

AI analyzes usage patterns, lineage, and project changes to recommend archival candidates

Focuses steward effort on high-impact cleanup decisions

Cross-System Relationship Discovery

Manual comparison of schemas across source systems

LLM suggests potential foreign key & semantic relationships based on naming and data patterns

Generates hypotheses for stewards to validate, speeding up integration projects

Data Quality Rule Generation

Stewards manually profile data to define rules

AI profiles sample data and suggests statistical & pattern-based quality rules

Stewards select and tune rules; jumpstarts quality program setup

ENTERPRISE DATA GOVERNANCE

Governance, Security, and Phased Rollout

Integrating AI with Informatica EDC requires a controlled approach to ensure metadata quality, security, and user trust.

A production integration typically uses a dedicated service account with role-based access to the Informatica EDC REST API and a secure, isolated environment for the LLM. The AI agent acts as a suggestor, not an auto-applier. All generated metadata—technical summaries, glossary term suggestions, or PII classifications—should be written to a staging table or a dedicated "AI Suggestions" custom object within EDC. This creates a clear audit trail and requires a data steward's review and approval before promotion to production metadata fields, ensuring human oversight and maintaining data governance integrity.

Start with a pilot on a single, well-understood data domain, such as a Customer or Product subject area. Focus the AI on a single high-value task, like generating column descriptions for newly discovered tables. This limits scope, allows for quality benchmarking against manual efforts, and builds stakeholder confidence. Subsequent phases can expand to business glossary term suggestion using approved terminology, and finally to sensitive data identification across broader asset inventories. Each phase should include a feedback loop where steward approvals and rejections are used to fine-tune the AI's prompts and improve suggestion relevance.

Governance is paramount. All AI interactions should be logged, including the source asset ID, the prompt used, the raw LLM output, and the steward's final action (accept, modify, reject). This traceability is critical for compliance audits and for continuously improving the system. Furthermore, ensure no raw business data is sent to the LLM; the integration should pass only technical metadata (column names, data types, sample values from profiling) and existing business glossary context. For PII detection, use pattern matching locally where possible, and only use the LLM for ambiguous cases, always masking or hashing actual data values before any external API call.

AI INTEGRATION FOR INFORMATICA DATA CATALOG

Frequently Asked Questions

Practical answers for data governance teams planning to augment Informatica Enterprise Data Catalog (EDC) with generative AI for metadata enrichment, glossary management, and compliance automation.

AI integration typically connects via Informatica's RESTful APIs and leverages its extensible metadata model. Key touchpoints include:

  • Asset API: To fetch discovered technical metadata (tables, columns, files, reports) for AI processing.
  • Glossary API: To create, update, or suggest business terms and categories.
  • Lineage API: To read and potentially enhance data flow relationships with business context.
  • Custom Properties: To write AI-generated summaries, PII classifications, or confidence scores back to assets as extended attributes.

A common pattern uses a middleware service (like a Python app) that:

  1. Polls EDC for newly discovered or updated assets.
  2. Sends asset metadata (e.g., column names, sample data, data types) to an LLM via a secure API call.
  3. Processes the LLM's response (e.g., a business description, PII flag).
  4. Writes the enriched metadata back to EDC via the Asset API.

This keeps the AI logic decoupled from the core EDC application, allowing for controlled rollouts and easy updates to prompts or models.

Prasad Kumkar

About the author

Prasad Kumkar

CEO & MD, Inference Systems

Prasad Kumkar is the CEO & MD of Inference Systems and writes about AI systems architecture, LLM infrastructure, model serving, evaluation, and production deployment. Over 5+ years, he has worked across computer vision models, L5 autonomous vehicle systems, and LLM research, with a focus on taking complex AI ideas into real-world engineering systems.

His work and writing cover AI systems, large language models, AI agents, multimodal systems, autonomous systems, inference optimization, RAG, evaluation, and production AI engineering.