Table of contents
We can now revisit a problem we saw earlier: interval scheduling. A generalised version of this problem assigns a weight to each interval , and asks whether we can efficiently maximise the total weight of the schedule
Input: A set of intervals such that and denote the start and finish time of interval . A map gives each interval a weight.
Output: A subset of maximum weight such that no two intervals are incompatible.
We were able to solve the unweighted problem using a greedy approach. However, adding weights makes that approach redundant; instead we use dynamic programming. The main observation is that for each interval there are two choices: it is either included in the schedule or it isn’t.
In the first case, we rule out any intervals which are incompatible with the included interval. We can do this with a map which gives the latest interval such that is compatible with and .
In the second case, out optimal solution is the same as the optimal solution where we ignore the current interval.
With these two options, we can define an optimal solution which considers intervals as the following:
Since the formula for each only accessed values of for earlier intervals; namely, and . This means we can compute from the bottom up, starting with , in linear time.
Let us now consider the problem of determining how similar two sequences, and are, where and is a finite alphabet.
Note
The notation indicates the cartesian product of arbitrary finite degree of the alphabet . In simpler terms, and are sequences of letters from the alphabet
A common application of this is search suggestion. When you misspell something when using a search engine, you usually get a response such as after determining, via sequence alignment, that the two words are similar.
While I misspelled work may be intuitively similar to the intended word, it is not clear how to represent this formally. The predominant model is to consider the Levenshtein Distance between the sequences of letters.
definition Levenshtein Distance
The Levenshtein Distance (or Edit Distance) between two sequences, denoted is the minimum cost of an alignment of
To understand this, we must also understand the terms cost and alignment:
definition Alignment
An alignment of two sequences and is a set of pairs such that each pair is either , or . We can think of this as matching each letter in one sequence to a letter in the other sequence, or a gap.
If our sequences are identical, it is clear there would exist an alignment with no gaps or mismatches. Conversely, two very different sequences are likely to have many gaps/mismatches.
If we assign a penalty to each gap or mismatch, we can then quantitatively describe how similar two sequences may be. Levenshtein proposes exactly this:
- For each gap, assign a penalty
- For each mismatch , assign a penalty
definition Cost
The cost of an alignment , denoted , is the sum o gap penalties and mismatch penalties in .
Hence, we can say that:
We can apply dynamic programming to the problem of finding an alignment which minimises in the following way. In each step, we minimise against our three choices:
- Match the next two characters of each word together, thereby adding to the edit distance. This leaves us with and unmatched characters in each word respectively (given we had and previously)
- Match the next character in the first word to a blank, thereby adding to the edit distance
- Match the next character in the second word to a blank, again adding to the edit distance
This can be represented by the following recursion:
If we observe that there is a bijection between the optimal alignment and the problem of finding a shortest path between two notes in a graph, we can find an optimal alignment in constant space.
The graph is constructed as follows:
- We have a node for each pair
- From each node there are up to three edges:
- If and , there is an edge corresponding to matching the and character if and respectively. This is labelled with the cost
- If there is an edge corresponding to matching the character of with a blank. This is labelled with
- If there is an edge corresponding to matching the character of with a blank. This is labelled with
Any path from the node to corresponds to an alignment along with its cost, and the shortest such path corresponds to an optimal alignment.
Using this, we can calculate a value for the shortest path from to each node row-by-row, starting at the bottom. Since no edge crosses more than 2 rows, we only need to “remember” 2 rows of information at a time, allowing for a linear space algorithm:
Space-Efficient Alignment (X, Y)
- Let be an array;
- For do
- for do
- for do
- end
- for do ;
- end
However, doing this we have no way to actually recover the alignment, only a way to get the value of the edit distance.
Using a different approach, we can recover the alignment itself whilst maintaining our space efficiency. Any path between and must pass through at least one node in each column.
Suppose we find a node in the path at , where gives the middle column, and denotes the row where . We now have 2 subproblems:
- Finding a path between and
- Finding a path between and
If we can perform this divide-and-conquer step in linear space, we can apply our original “space-inefficient” algorithm once the subproblems fall below a certain fixed size.
So the question becomes: how do we do the first step in linear space? Recall that we already have an algorithm for calculating the shortest path from for a given row/column of nodes (space-efficient alignment). It is easy to adapt this algorithm into another which gives the shortest path from to a given row/column of nodes (backwards-space-efficient alignment). If we calculate these two values for column , this gives us values for the shortest path length passing through each node in that column. We can then just pick the node with the lowest value.
While efficient, Dijkstra’s Algorithm quickly runs into problems when considering graphs with negative edge-costs and/or negative cycles. We’ll consider a dynamic programming approach which can solve the first case, and later extend it to detect the second.
To begin, let’s define our problem:
Input: A weighted digraph and a target vertex
Output: If has no negative cycles, then , return:
- The minimum cost of a path
- The min-cost path itself
Claim
If has no negative cycles then there is a min-cost that is simple
Proof
Let be a min-cost path with a cycle which starts and ends at an intermediate node , where denotes the path from and denotes the path from
Then is also a path, with cost:
Hence the path is a lower-cost path. By contradiction, our proof is complete,
An important consequence of the above claim is that a shortest path between any two nodes has at most edges, since otherwise there must be a cycle, This allows us to apply dynamic programming by recursing on the length of the shortest path.
Suppose we are interesting in finding the length of a shortest path between nodes and . The shortest path which uses at most edges is either:
- The same as the shortest path which uses at most edges, if it uses edges.
- The cost of an edge to a neighbour of , plus the cost of the shortest path from to which uses at most edges,
In this way, we “use up” edges one at a time, and minimise over the length of the path at each step, giving us the overall shortest path length overall. As a recursion, we can express the optimal length of a path which uses at most edges as the following:
The algorithm, due to Richard Bellman and Lester Ford Jr, solves the stsp Problem and is built around the recursion:
**Bellman-Ford() - Let be an array;
- For do
- ;
- for do
- for do
- end
- for do
- end