Regular Expression Matching: A Dynamic Programming Walkthrough
- 1 day ago
- 12 min read
Regular Expression Matching is one of the hardest "easy-to-explain" dynamic programming problems in the standard interview rotation. The problem statement fits in three sentences, but the implementation rewards a very specific kind of thinking: can you take a vague, recursive-feeling specification and turn it into a precise DP table with rigorously-defined transitions? Candidates who try to write the matcher imperatively tend to drown in nested conditionals. Candidates who reach for backtracking burn time on edge cases. Candidates who slow down, define dp[i][j] cleanly, and enumerate the transitions one at a time make it look easy. That last skill - turning a messy spec into a clean state - is the actual signal the interviewer is testing.
The technique behind this problem is the same one that powers production regex engines, lexical analyzers in compilers, DNA pattern matching in bioinformatics, log-parsing rule systems, and any text-processing pipeline that needs to match a pattern with wildcards. The 2D DP table you're about to build is essentially a hand-rolled version of the NFA-to-table conversion that real regex implementations do under the hood.
Problem Statement
Given an input string s and a pattern p, implement regular expression matching with support for:
. which matches any single character
* which matches zero or more of the preceding element
The matching should cover the entire input string, not partial matches.
Return true if the string matches the pattern, otherwise false.
Example:
input: s = "aab", p = "c*a*b"
answer: true (the c* matches zero cs, a* matches two as, and b matches b).
1. Clarify Requirements Before Jumping Into Code
To start, let's restate the problem in our own words and ask the kinds of questions that would change our approach.
Input: a string s and a pattern p.
Output: true if the pattern matches the entire string, false otherwise.
Details to clarify:
Does the match have to cover the entire string? Yes, partial matches don't count.
Is the pattern guaranteed to be valid? Assume yes
Does * apply to the single character before it, or to a group? Single character only. There's no grouping syntax in this version.
Can * match zero of the preceding character? Yes, that's the critical case that breaks naive implementations.
Any other regex features (+, ?, character classes, anchors)? No, just . and *.
The thing to flag is that * means zero or more, not one or more. Patterns like a * matching the empty string, or a*b*c* matching empty, are going to be the bug sources. A solution that handles "match exactly one" cleanly will still fail on "match zero" if we're not careful.
2. Identify the Category of the Question
A few signals jump out:
We're checking whether two sequences are compatible under a set of rules.
The decision at each position depends on a small local context - the current character in the string and the current (and sometimes next) character in the pattern.
The subproblems overlap: a prefix-vs-prefix question reappears many times across different match attempts.
That combination of two strings, local decisions, and overlapping subproblems is the fingerprint of 2D string DP. The state is almost always dp[i][j] over prefixes of the two inputs, and the transitions involve a small number of cases at the last character of each prefix. This is the same family as Edit Distance, Longest Common Subsequence, and Wildcard Matching. The twist that makes Regex Matching trickier than its cousins is the * operator, which can match a variable number of characters from the string.
3. Brute Force Solution
As always, we start with the brute force. The most direct approach is recursion with backtracking. At each position (i, j) in s and p:
If the next pattern token is followed by *, try matching zero of the preceding character, or try consuming one character of s and stay on the same pattern position (effectively matching one or more).
Otherwise, try matching the single character (or .) and advance both pointers.
match(i, j):
if j == len(p): return i == len(s)
firstMatch = i < len(s) and (p[j] == s[i] or p[j] == '.')
if j + 1 < len(p) and p[j+1] == '*':
return match(i, j + 2) or (firstMatch and match(i + 1, j))
return firstMatch and match(i + 1, j + 1)This is correct, and it's actually a reasonable interview answer if memoized. Without memoization, it's exponential - the * case branches in two directions at each step, and many of those branches re-explore the same (i, j) pairs.
The brute force does teach us one thing worth carrying forward: the entire decision at any point depends only on the current indices i and j. There's no hidden state. That's the property that makes DP work.
4. Brainstorm More Solutions
Step 1: How would we approach this by hand?
Let's think about how we would do this by hand to make sure the rules are clear.
Pattern c*a*b, string aab.
The pattern starts with c*. That means "zero or more c's." The string starts with a, so there's no c to match - we treat c* as matching zero characters and move past it. String is still aab, pattern is now a*b.
Next is a*. The string starts with a, so we can match one or more a's. We greedily consume both a's. String is now b, pattern is b.
Last is b. Direct character match. Both end. ✓
Notice what the * did. It made an implicit choice - match zero, match one, match many. The interesting thing is that which choice was right could only be determined by looking ahead, or by trying both. That's the source of the difficulty.
So the algorithm has to encode this choice somehow. Either we backtrack (the brute force solution), or we use DP to memoize the results of trying both choices.
Step 2: Define the state
Let dp[i][j] be true if the first i characters of s match the first j characters of p, and false otherwise. The answer we want is dp[m][n] where m = s.length() and n = p.length().
Why does this state work as a self-contained subproblem? Because matching only depends on positions, not on the history of how we got there. Two different "paths" through the matching process that both arrive at (i, j) will give the same final answer from that point onward.
The base case is straight-forward: dp[0][0] = true, since an empty string matches an empty pattern.
Step 3: Work out the easy transitions first
Now we need to identify the transition. What rule lets us compute dp[i][j] from smaller subproblems?
The brute force solution gave us a hint. Every match decision was a local decision - at each position, we looked at the current character in s and the current character (and maybe the next one) in p, then advanced or didn't. That locality means the answer to dp[i][j] should depend on what's happening at the end of the prefixes, not on some pattern spread across the whole string.
But "the end" has two sides - the end of s and the end of p. Which side do we anchor on? Let's think about what each one tells us.
If I anchor on s[i-1] - the last character of the string prefix - I learn what literal character needs to be matched, but I don't yet know how the pattern will consume it. The pattern could be trying to match it directly with p[j-1], or it could be in the middle of a * expansion. The string character alone doesn't tell me which rule of the pattern is active.
If I anchor on p[j-1] - the last character of the pattern prefix - I learn which rule is being applied at this position. Is it a literal character? A "."? A "*"? The pattern character determines what kind of transition is happening, and the string character only enters as evidence for whether that transition succeeds.
That's the asymmetry. The pattern dictates the type of transition; the string is just data being consumed. So we anchor on the pattern's last character and split into cases based on what that character is.
Now we can ask the natural question: what are the possible values of p[j-1]?
A regular letter (a through z).
A . (matches any single character).
A * (modifies whatever character came before it).
Letters and . behave the same way for our purposes - both want to consume exactly one character of s. So they collapse into a single case. That leaves us with exactly two cases to handle: p[j-1] is * , or p[j-1] is not *.
The non-* case is the easier one, so let's start there.
To match the prefix s[0..i) against p[0..j), the last character of the string s[i-1] must match the last character of the pattern p[j-1], and the rest must also match. "Match" here means literal equality, or p[j-1] == '.'.
if p[j-1] == s[i-1] or p[j-1] == '.':
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = falseThis is essentially the same transition as Edit Distance's "characters match" case. Clean and straightforward.
Step 4: Work out the * case carefully
Case B: p[j-1] is *
This is where the problem gets interesting. The * doesn't stand alone; it always refers to the character p[j-2] that comes immediately before it. So the pattern token we're considering is really the pair (p[j-2], *), and we need to decide what that pair matches.
The pair (p[j-2], *) can match either:
Zero occurrences of p[j-2]. We're essentially deleting the pair from the pattern. The rest of s[0..i) then has to match p[0..j-2) - that is, the pattern with the (char, *) pair removed.
dp[i][j] = dp[i][j-2]
One or more occurrences of p[j-2]. We consume one character from s (the last one), and the * stays "alive" to potentially match more. So s[0..i-1) has to match p[0..j) - same pattern, one fewer character of s. This is only valid if p[j-2] actually matches s[i-1].
if p[j-2] == s[i-1] or p[j-2] == '.': dp[i][j] = dp[i][j] or dp[i-1][j]
If either option is true, dp[i][j] is true.
Let's pause on the "one or more" case, because it's the one that trips people up. The transition is dp[i][j] = dp[i-1][j] - we move backward one character in s but stay at the same column j in the pattern. That's strange compared to typical DP transitions, which usually decrease both indices. The reason it works: the * represents an unbounded number of repetitions, so consuming one character of s doesn't "use up" the *. It just adds one more repetition to the count. The same (char, *) pair is still available to consume more.
You can think of dp[i-1][j] as asking: "if I've already used this * to match the last character of s, can the rest of s still match the same pattern (which still has the * available)?"
Step 5: Initialize the boundary
Now the tricky base cases. dp[0][0] = true is the easy one. What about dp[0][j] for j > 0 - can an empty string match a non-empty pattern?
Yes - but only in a specific case. Patterns like a*, a*b*, or a*b*c* can all match the empty string, because every * can match zero of its preceding character. But a or ab cannot match the empty string. So dp[0][j] is true only if the pattern is a sequence of (char, *) pairs.
Rather than special-casing this, notice that the recurrence handles it for us. If p[j-1] == '*', then dp[0][j] = dp[0][j-2] (the zero-occurrence branch - the one-or-more branch requires i >= 1 and is irrelevant for an empty string). That naturally propagates: dp[0][0] = true, dp[0][2] = dp[0][0] = true if p[1] == '*', dp[0][4] = dp[0][2] = true if p[3] == '*', and so on.
For dp[i][0] with i > 0 - a non-empty string against an empty pattern - the answer is always false. That's the default value of a boolean array, so we don't need to write anything.
Step 6: Walk through an example to verify
Let's verify the recurrence works with the example s="aab" and p="c*a*b". The DP table looks like this (rows are s, columns are p):
"" c * a * b
"" T F T F T F
a F F F T T F
a F F F F T F
b F F F F F TA few cells worth checking:
dp[0][2] = T: empty string vs c*. * case, zero occurrences: dp[0][0] = T. ✓
dp[0][4] = T: empty string vs c*a*. * case, zero occurrences: dp[0][2] = T. ✓
dp[1][3] = T: "a" vs c*a. Direct character match (a == a), so look at dp[0][2] = T. ✓
dp[2][4] = T: "aa" vs c*a*. * case. Zero occurrences: dp[2][2] = F. One or more (since a == a): dp[1][4] = T. Take the OR: T. ✓
dp[3][5] = T: "aab" vs c*a*b. Direct character match (b == b), so look at dp[2][4] = T. ✓
The recurrence holds. Tricky cases like dp[2][4] work because the one-or-more branch lets * consume an additional character without changing columns.
Step 7: Complexity
There are two indices in the state, i and j, ranging up to m and n respectively. That gives us O(m × n) distinct subproblems. Each subproblem does O(1) work - at most a couple of character comparisons and table lookups. By the "number of states × work per state" rule, the total runtime is O(m × n) time and O(m × n) space for the DP table.
This progression - brute-force recursion → memoization → bottom-up 2D DP - is the same pattern that shows up across most 2D string DP problems. What's specific to Regex Matching is the careful case split on *, and the unusual transition where dp[i-1][j] keeps the pattern column fixed. Once those two pieces click, the problem is mechanical.
5. Discuss Trade-Offs Between Solutions
Approach | Time | Space | When I'd use it |
Brute-force recursion | Exponential | O(m+n) stack | Never, but worth mentioning to frame the problem |
Recursion + memoization | O(m × n) | O(m × n) | Easiest to write if I'm short on time and the case split is fresh |
Bottom-up 2D DP | O(m × n) | O(m × n) | My default interview answer, predictable iteration order and easy to debug |
There's a potential O(n) space optimization using rolling rows (since dp[i][j] only reads from dp[i-1][...] and dp[i][...]), but the bookkeeping adds unnecessary complexity. It is still worth mention it as a possible follow-up, especially if asked about space.
6. Pseudocode
m = length of s, n = length of p
dp = (m+1) x (n+1) boolean table, all false
dp[0][0] = true
# Handle patterns that can match empty string: a*, a*b*, a*b*c*, ...
for j from 2 to n:
if p[j-1] == '*':
dp[0][j] = dp[0][j-2]
for i from 1 to m:
for j from 1 to n:
if p[j-1] == '.' or p[j-1] == s[i-1]:
dp[i][j] = dp[i-1][j-1]
else if p[j-1] == '*':
dp[i][j] = dp[i][j-2] # zero occurrences
if p[j-2] == '.' or p[j-2] == s[i-1]:
dp[i][j] = dp[i][j] or dp[i-1][j] # one or more occurrences
return dp[m][n]7. Edge Cases
Things to verify before claiming we're done:
Empty string, empty pattern → dp[0][0] = true. ✓
Empty string, pattern with only (char, *) pairs → handled by the boundary initialization. ✓
Non-empty string, empty pattern → dp[i][0] = false by default. ✓
Pattern .* against any string → . matches anything, * lets it repeat any number of times, so always true. ✓
Pattern starting with * → assumed invalid by problem constraints.
Repeated characters in pattern (e.g., a*) → assumed invalid.
Long alternations like a*a*a*a*aaaa against "aaaa" → handled correctly because every * has both a zero and one-or-more branch.
The boundary initialization on row 0 is the case most candidates get wrong. Walking through it explicitly during the interview earns trust.
8. Full Code
public class RegularExpressionMatching {
public static boolean isMatch(String s, String p) {
int m = s.length();
int n = p.length();
boolean[][] dp = new boolean[m + 1][n + 1];
dp[0][0] = true;
// Boundary: empty string can match patterns like
// a*, a*b*, a*b*c*, ... because every '*' can
// match zero of its preceding character.
for (int j = 2; j <= n; j++) {
if (p.charAt(j - 1) == '*') {
dp[0][j] = dp[0][j - 2];
}
}
for (int i = 1; i <= m; i++) {
for (int j = 1; j <= n; j++) {
char sc = s.charAt(i - 1);
char pc = p.charAt(j - 1);
if (pc == sc || pc == '.') {
// Direct character match (or '.'):
// rely on the previous prefix pair
dp[i][j] = dp[i - 1][j - 1];
} else if (pc == '*') {
char prev = p.charAt(j - 2);
// Option 1: '*' matches zero occurrences —
// skip the (char, *) pair
dp[i][j] = dp[i][j - 2];
// Option 2: '*' matches one or more — only
// valid if the preceding pattern character can
// match the current string character.
// Note: column stays at j because '*' is still
// "alive" for more matches.
if (prev == sc || prev == '.') {
dp[i][j] = dp[i][j] || dp[i - 1][j];
}
}
}
}
return dp[m][n];
}
}9. Test the Code
System.out.println(isMatch("aa", "a")); // false
System.out.println(isMatch("aa", "a*")); // true
System.out.println(isMatch("ab", ".*")); // true
System.out.println(isMatch("aab", "c*a*b")); // true
System.out.println(isMatch("mississippi", "mis*is*p*.")); // false
System.out.println(isMatch("", "")); // true
System.out.println(isMatch("", "a*")); // true
System.out.println(isMatch("a", ".*..a*")); // false10. Key Lessons
When a problem has an operator that introduces variable-length matching (*, +, ?), the DP transition for that operator usually has two branches: one where the operator matches zero/empty, and one where it consumes from the input. Enumerating those branches explicitly is the way to keep the case analysis clean.
A transition that stays on the same column (or same row) is unusual but valid. In Regex Matching, dp[i-1][j] keeps the pattern position fixed because * represents unbounded repetition. Whenever you encounter an "indefinite" operator, expect this kind of fixed-position transition.
Boundary cases for prefix DP aren't an afterthought, they encode genuine semantic information. dp[0][j] here captures the fact that patterns made entirely of (char, *) pairs can match the empty string. Skipping that initialization gives the wrong answer for inputs like ("", "a*b*").
Two-string DP problems almost always have state dp[i][j] over prefixes. Once you know that, the only real work is figuring out what cases the last position can fall into and what each case's transition looks like. Regex Matching has exactly two cases (non-* and *), and the * case has exactly two sub-branches.
When a specification feels "fuzzy" (zero or more, optional, any character), the way to make it precise is to enumerate what the operator can do at the current position and write a transition for each option. Don't try to handle all interpretations at once - handle them one branch at a time.
The thing that makes Regular Expression Matching click isn't deep regex knowledge. It's the discipline of slowing down and writing each branch as its own transition. Once you've done that cleanly, the problem stops feeling like regex and starts feeling like every other 2D string DP you've solved.
Good Luck and Happy Coding!
Comments