‘Bots are off limits this week so here is a story from this year’s Iverson College – a fantastic week spent in the company of a wonderful mixture of array and functional language gurus and newbies, all learning from each other. One evening, Dhru Patel presented a problem that he was working on which involved displaying the results of a “diff” side by side with the matched rows aligned. For example, the input might be:
OLD NEW
┌────────┬────────┐
│This │This │
│is the │original│
│original│is not │
│text │ │
└────────┴────────┘
Edited by Yoda, the text was. The selection of rows to be aligned requires finding the longest sequence of rows from the original data that matches rows in the edited data without ever skipping backwards through either sets of data. In this case, it is the original first and third row, matching the first two edited rows. We will return to how we might identify these rows in a future blog entry (meanwhile, try to grok John Scholes’ YouTube video on Depth First Searching and think about it). For now, we will provide this information as a left argument, a vector of Boolean vectors that marks the location of the matched rows:
(1 0 1 0)(1 1 0) AlignMatched OLD NEW
┌────────┬────────┐
│This │This │
│is the │ │
│original│original│
│text │ │
│ │is not │
└────────┴────────┘
At Iverson College there was general agreement that “there should be a non-looping solution”, and Devon McCormick immediately stated that he would bet that it involved grade (⍋
). Let us explore:
masks←(
1 0 1 0)(1 1 0)
⎕←matched←∊masks ⍝ The two match masks catenated together 1 0 1 0 1 1 0 ⎕←origin←(≢¨OLD NEW)/0 1 ⍝ 0 for items from the old array, 1 for new 0 0 0 0 1 1 1 ⎕←block←∊+\¨masks ⍝ running count of matched rows 1 1 2 2 1 2 2 ⎕←data←block,(~matched),origin,OLD⍪NEW 1 0 0 This 1 1 0 is the 2 0 0 original 2 1 0 text 1 0 1 This 2 0 1 original 2 1 1 is not
Our goal is to create an expansion mask for each argument; this is going to insert blank lines at the points where a non-matched row from the other argument is included. The next step is to reorder everything by ascending block number, and within each block move the matched rows to the front (ascending by ~matched), as follows:
⎕←data←data[⍋data[;1 2];]
1 0 0 This
1 0 1 This
1 1 0 is the
2 0 0 original
2 0 1 original
2 1 0 text
2 1 1 is not
This contains all the items from both texts in the order that they would need to appear in the final results. We can extract the reordered flag vectors:
matched←~data[;2] ⋄ origin←data[;3]
Now, (origin=0) is an expansion mask that would expand OLD to match the above, and (origin=1) would do the same for NEW:
0 1 {(origin=⍺)⍀⍵}¨OLD NEW
┌────────┬────────┐
│This │ │
│ │This │
│is the │ │
│original│ │
│ │original│
│text │ │
│ │is not │
└────────┴────────┘
To align the matched rows, we need to eliminate inserted blanks that correspond to matched rows from the other side:
0 1{((~matched∧origin≠⍺)/origin=⍺)⍀⍵}¨OLD NEW
┌────────┬────────┐
│This │This │
│is the │ │
│original│original│
│text │ │
│ │is not │
└────────┴────────┘
In the final function, which collects the relevant lines of code from our experiments, we do not create the temporary “data” matrix, but reorder the origin and matched vectors individually – and instead of doing grade up on a two-column matrix, we compute an integer vector that will sort by descending matched within ascending block (because we know that grade up on a simple small-range integer vector will run like greased lightning):
AlignMatched←{ ⍝ align matched rows of 1⊃⍵ and 2⊃⍵
matched←∊⍺ ⍝ matches are marked by 1⊃⍺ and 2⊃⍺
origin←(≢¨⍵)/0 1 ⍝ identify origin of items in matched (0=old. 1=new)
block←∊+\¨⍺ ⍝ running count of matched rows
order←⍋(2×block)-matched ⍝ Order so matching rows are adjacent and order of
(origin matched)←(⊂⊂order)⌷¨origin matched ⍝ items following matched row is preserved
0 1{((~matched∧origin≠⍺)/origin=⍺)⍀⍵}¨⍵ ⍝ Expand exluding matched from "other" list
}
Reviving Lost Arts
The algorithm above makes use of techniques that were well-known in APL circles in the 1980s, but atrophied after nested arrays arrived on the scene and applications tended to keep parts of the data in separate leaves of an array rather than using simple data structures.
If you would like to read up on some of the old techniques, you might enjoy browsing the FinnAPL Idiom Library, and Bob Smith’s immortal Boolean Functions and Techniques. Although nested arrays might have made them a little less relevant in the 90s and 00s, the search for high-performance parallel solutions could bring them back, as explained in this session from Dyalog ’12, on Segmented Scans and Nested Data Parallelism by Andrzej Filinski.
4 Responses
When I recall looking at this problem back in the day, I thought in terms of getting an index vector such that
would do the alignment. Much to be said for a well-manipulated \.
This works:
where:
(Of course, OLD and NEW need to be the same width.)
I didn’t recall using upgrade when I first solved this, so I went through the exercise again. It wasn’t needed.
The use of upgrade, which does work, may have led to the clever use of the foo¨0 1 pattern as a separator.
(this site needs an ‘Edit Comment’ feature.)
The full solution, which does NOT work when either OLD or NEW has duplicated lines:
The blog is a fairly standard WordPress site and I have to agree with you that editing comments would be nice – we will look into it.
I cannot get your expression to work, I get a LENGTH ERROR. When I look closer:
The mask on the left has 3 ones and cannot be used to expand OLD, which has 4 rows. Even if you reverse the order and use it to expand NEW, the order is incorrrect. Can you post a full worked example with output (or send it to me and I will write it up)?
The example data may have given the impression that the files might be unique. I did try to suggest that finding the longest common subsequence between any input file and an edited version of it was a difficult task, requiring recursion – but that was clearly too subtle – you are not the only person to make the wrong assumption. I do apologise!
I aim to describe my attempts to write a function to find the “longest common subsequence” in a blog post within the next few days.