Back to Journal
AI & Developer ToolsDecember 2023·9 min read

Parsing 12,000 Lines of SAP ABAP with AST Extraction

Building a production-grade parser for legacy enterprise code to feed structured context into AI-powered documentation tools.

The Problem

A large enterprise wanted to modernize their SAP ecosystem. Before any migration could begin, they needed to understand what their existing ABAP codebase actually did. 12,000+ lines of business logic accumulated over 15 years, with sparse documentation and the original developers long gone.

Their ask: can we use AI to generate documentation?

The challenge: AI models need structured context. Dumping raw ABAP code into GPT-4 produces superficial summaries. We needed to extract meaningful program structure first.

Why AST Extraction?

Abstract Syntax Trees (ASTs) represent code as structured data. Instead of:

IF lv_status = 'ACTIVE'.
  PERFORM calculate_discount USING lv_amount.
ENDIF.

We get:

{
  "type": "IfStatement",
  "condition": {
    "type": "Comparison",
    "left": {"type": "Variable", "name": "lv_status"},
    "operator": "=",
    "right": {"type": "Literal", "value": "ACTIVE"}
  },
  "body": [{
    "type": "PerformCall",
    "subroutine": "calculate_discount",
    "parameters": [{"type": "Variable", "name": "lv_amount"}]
  }]
}

This structured representation allows us to:

  • Identify data flows
  • Map dependencies
  • Extract business rules
  • Generate accurate documentation

Building the Parser

ABAP is... unique. It predates modern language design conventions and has accumulated decades of syntax variations.

Lexical Analysis

First, we tokenize the raw code:

class ABAPLexer:
    KEYWORDS = {'IF', 'ENDIF', 'LOOP', 'ENDLOOP', 'PERFORM', ...}

    def tokenize(self, source: str) -> List[Token]:
        tokens = []
        position = 0

        while position < len(source):
            # Handle string literals
            if source[position] == "'":
                token, position = self.read_string(source, position)
            # Handle identifiers and keywords
            elif source[position].isalpha():
                token, position = self.read_identifier(source, position)
            # ... other token types

            tokens.append(token)

        return tokens

Handling ABAP Quirks

ABAP has several parsing challenges:

  1. Case insensitivity: IF, if, and If are equivalent
  2. Chained statements: WRITE: a, b, c. expands to three statements
  3. Macros: Inline code generation that must be expanded
  4. Dynamic calls: PERFORM (lv_name) where the subroutine is runtime-determined

We handled each with specific strategies:

def parse_perform_statement(self) -> PerformNode:
    self.expect('PERFORM')

    # Check for dynamic call
    if self.current_token.type == 'LPAREN':
        return self.parse_dynamic_perform()

    subroutine_name = self.expect_identifier()

    parameters = []
    if self.match('USING'):
        parameters = self.parse_parameter_list()

    return PerformNode(
        subroutine=subroutine_name,
        parameters=parameters,
        is_dynamic=False
    )

Dependency Mapping

With the AST, we built a dependency graph:

class DependencyAnalyzer:
    def analyze(self, ast: ProgramNode) -> DependencyGraph:
        graph = DependencyGraph()

        for subroutine in ast.subroutines:
            # Add node for each subroutine
            graph.add_node(subroutine.name)

            # Find all PERFORM calls within
            for call in self.find_perform_calls(subroutine.body):
                graph.add_edge(subroutine.name, call.target)

        return graph

This revealed the true structure of the codebase:

  • 847 subroutines
  • 2,340 dependencies
  • 12 isolated components (dead code)
  • 3 circular dependency clusters

AI Documentation Generation

With structured AST data, we could prompt AI models effectively:

def generate_documentation(subroutine: SubroutineNode) -> str:
    context = {
        "name": subroutine.name,
        "parameters": [p.to_dict() for p in subroutine.parameters],
        "calls": [c.target for c in find_perform_calls(subroutine.body)],
        "tables_accessed": find_table_accesses(subroutine.body),
        "business_rules": extract_conditions(subroutine.body)
    }

    prompt = f"""
    Document this ABAP subroutine based on its structure:

    Name: {context['name']}
    Parameters: {context['parameters']}
    Calls: {context['calls']}
    Database tables: {context['tables_accessed']}
    Business rules: {context['business_rules']}

    Provide:
    1. A one-sentence summary
    2. Parameter descriptions
    3. Business logic explanation
    4. Side effects and dependencies
    """

    return call_llm(prompt)

Results

The project delivered:

  • Full AST extraction for 12,000+ lines of ABAP
  • Dependency visualization revealing true program structure
  • Generated documentation for 847 subroutines
  • Dead code identification saving migration effort

Documentation quality was validated by SAP consultants familiar with the domain—87% accuracy on business logic descriptions.

Lessons Learned

  1. Legacy code has patterns. Even messy codebases have internal consistency. Find and exploit it.

  2. ASTs unlock capabilities. Structured representation enables analysis that's impossible with raw text.

  3. AI needs context. The quality of AI output directly correlates with the quality of input structure.

  4. Test against reality. We validated against actual ABAP behavior, not just syntax specifications.


Modernizing legacy systems starts with understanding them. AST extraction transforms code archaeology from guesswork into systematic analysis.

TL

TAKKA LABS

Engineering Team

More articles

Have a similar challenge?

We love solving complex technical problems. Let's talk about your project.

Get in touch