Database Elements Problem 2: Version Control of Init Routines

If the answer so far to the question "What is VISTA made of?" is only the tip of the iceberg when it comes to the complexities introduced by non-routine software elements, what else makes up this iceberg?
 
Flattening data and definitions into Init routines might result in convenient, flat structures (routines) that lend themselves to version control, but the purpose of these routines is to regenerate these data and definitions at the destination VISTA systems, not to support version control. An Init routine is not a snapshot of a file; it is software that regenerates that file in another VISTA system.
 
If our goal is to keep track of the Init routines for their own sake - and that is one of our goals - then loading them into Git or other version-control repositories makes good sense. If, however, our goal in so doing is to successfully version-control the data and definitions they transport, we're in for numerous disappointments.
 
There is hardly ever a one-to-one mapping between Init routines and files. Almost always, it takes many Init routines to represent one file. The number of Init routines it takes varies depending on (1) the amount of data and definitions to transport and (2) the routine-size settings for the MUMPS implementation used in the VISTA environment where the Init routines were generated.
 
Let's look at the first problem.
 
Imagine you want to compare two files that are nearly identical. One has 400 records. The other has the same 400 records plus one more, inserted near the top of its copy of the file. Ideally, a version-control system should show these two files as being nearly identical, with the exception of the one inserted record.
 
But if we set out to run that comparison by first generating Init routines and comparing them, we'd end up with a terrible signal-to-noise problem. Since Init routines store the software to regenerate data sequentially, the two sets of Init routines would resemble each other only until they reached the inserted record.
 
Thereafter, the lines of data-regenerating code would not just be shifted down. Each Init routine's maximum size is limited based on the routine-size limits of the MUMPS implementation and the MUMPS standard overall. Therefore, many of the lines will also be shifted out to a different routine. Version-control systems are designed to compare one "unit" at a time, where a unit is some kind of flat file with a given name, such as a routine. When the lines are moved off to a different routine with a different name, the version-control system loses the ability to tell that we're talking about the same lines. They seem to it to be coincidentally similar but otherwise completely separate lines, and will be reported as differences rather than as matches.
 
In other words, because the inserted record shifts all the remaining records so they start crossing routine boundaries and appearing in different routines, from very early on the version-control system is going to report these two nearly identical files as being almost entirely different.
 
Now let's look at the second problem.
 
A similar confusion happens if you have two identical files on two systems that use different routine-size limits. One of them might be inserting up to 10,000 characters at a time into its Init routines, the other up to 40,000 characters at a time into its Init routines. This too will change the boundaries where records are split between routines, so that a version-control comparison of the two sets of Init routines cannot recognize them after the first routine as being similar at all. Here there is no actual difference between the files, making the bad signal-to-noise ratio even more frustratingly gratuitous.
 
If MUMPS routines could be conceivably infinite in size - the way files can be - then we could revise DIFROM to always store one complete file in one Init routine. This would avoid this routine-boundary problem that comes from sequentially pouring a files records into a series of routines, which is vulnerable to shifts in the boundaries.
 
But if we solved our version-control problem by doing that, we would create a portability problem, because no computer can store an infinite amount. Routines have limited maximum sizes so that we can be sure they will run on *every* MUMPS implementation on every operating system on every piece of hardware. Portability is far more important to VISTA than version control, so we are not going to sacrifice portability for version control.
 
Aside from these two problems, there are a lot more with trying this Init-routine approach to the version control of files.
 
Even if we did sacrifice portability for version control, Init routines are - again - not snapshots of files; they are code to regenerate those files elsewhere. They contain more than just data and definitions. They also include timestamps of the dates and times when they were generated, and information about the environment where they were created, and so on. They contain software written dynamically by DIFROM to perform the regeneration itself; some of that software does not correspond to data or definitions from any file, and is inserted as needed throughout these Init routines. Then, too, most of our Init routines carry more than just a single file; all those other files mixed into the same suite of Init routines will also throw off any attempts to cleanly compare the same file from two specific systems. And there's more in Init routines than just files, other things (to be discussed later) that we would prefer to version control separately. All of these things are in the routines and will show up as differences, creating more signal-to-noise problems.
 
Even without all these additional problems, even if all we had to deal with were the first two, what we're running up against is a fundamental problem we will experience any time we try to use too indirect a method to accomplish our goals in computer science.
 
The version-control system has no way of knowing that the next routine is meant to be understood as a continuation of the first one. To contemporary version-control systems, each unit, each routine, is a completely independent unit of version control. What they mean - that they are meant to be understood as a sequence of lines of data and definitions - is something only human beings can know. No third-generation programming system is capable of encoding or comprehending meaning, and without meaning even the simplest shift in structure can completely confuse the software - including version-control software. This conceptual fragility stalks all of our efforts throughout the information-technology world, most notoriously when trying to create interfaces between different systems. This conceptual fragility is the state of the art in computer science.
 
No matter how cleverly we go about it, we're left with the inescapable conclusion that trying to do version control on files by generating Init routines for them and doing version control on them is a bad idea - it's far too indirect and messy a way of comparing files - though doing version control on those same Init routines for their own sake, if we need them, is never a bad idea. We need something a lot more direct.
 
Besides, there are other issues with Init routines at present that explain why the VISTA community abandoned them in the late 1990s as the main way of transporting VISTA files. We still use them sometimes - and someday with some enhancements to Fileman and MUMPS (two of which were included in File Manager version 22.2) we will once again begin using them more often - but for now these other issues led us to invent a new way of distributing and installing VISTA software, including files: KIDS.
 
We will properly introduce KIDS next time.
like0