. | . | . | . | David McCracken |
Questionsupdated:2018.09.07 |
My experience can be parsed into easily identified areas of expertise, such as C++, USB/I2C/SPI, device drivers, firmware, medical instruments, etc. I have done substantial successful work, much of it original, in these areas. I have frequently extrapolated beyond standard usage models to bring out significant new capabilities. However, these are just tools. That I understand and fully exploit my tools makes me an expert in their use, but the tools themselves don’t define me. It doesn’t bother me to discard or replace a tool or to get a new one and become expert in its use. It is part of the fun of being an engineer.
My core strength is that I am a toolmaker. This is superficially obvious from specific work, for example my Z8 in-circuit emulator (my first commercial success). My colleagues at Hitachi explicitly called me a toolmaker because of my propensity to solve their problems by developing the tools for them to solve their own problems. A “tools first” attitude infuses everything that I do, although the influence is not always obvious. I use an analytical meta-tool hierarchy to design programs. At the highest level, I look for repeating patterns. At the next level, I design little languages for describing these patterns generically, just as a grammar, itself described by a meta-language, describes how specific sentences of a particular language can be composed without stating a single instance of such a sentence. When I get down to the level of writing code I appeal to general computation analysis, particularly recognizing the superiority of an algorithm over all other forms, and of table-driven over control flow. This guides my work at all levels, from a single statement to the most complex multi-domain system of systems.
A common misconception is that my approach takes too much time. Despite my experience to the contrary, I would not be so sure of it except that all design methodologists agree that up-front effort pays for itself many times over and that a “tools first” approach always increases productivity.
I have trouble making intuitive decisions. I can’t overcome this weakness but I have learned compensating techniques, which are valuable in themselves. I often reduce even very complex and apparently indeterminate questions to a formula with a numeric answer and I have no trouble selecting the 51% answer over the 49% answer. I thoroughly investigate all questions, documenting every piece of information that argues for one answer over another. This usually leads to one answer being the overwhelmingly best choice but, in any case, affords me a record that I can reread whenever I begin doubting my decision. And I am a programmer. I can change my decisions whenever I want, significantly reducing my hesitation to decide without absolute justification.
As valuable as my compensating techniques have proven to be, they don’t correct my underlying weakness and sometimes an intuitive decision must be made. At that point, I ask someone else to do it, for example my boss at Abbott. He liked that I recognized my own weakness and would take advantage of his strength in making intuitive decisions. He also knew that the decisions I did make were justified, because it would be agonizing for me to do otherwise.
One consequence of this weakness is that I don’t know what to do with myself. Fortunately, others have known what to do with me. They just give me a problem to solve and I’m happy. It doesn’t matter to me whether it is to write one line of code or to design an enormously complex system of systems. My work does include an unusually broad range of very hard problems. I have not seeked these out and I don’t think they would have been presented to me except that other engineers, often seeming to have more appropriate expertise, had failed to solve them. Basically, I have gotten “left-overs”, which ironically have turned out to be the kinds of opportunities that engineers more sure of their own direction strive for.
Programmers fall into three different groups on this question. Some create application-specific classes by derivation from existing classes. Some use object-oriented analysis to design original classes that accurately express important characteristics of the system or problem. I am comfortable in either of these groups. Programmers in the third group paste an OO veneer over a fundamentally non-OO program and justify the bigger, slower, and more complex result by claiming that it is “more object-oriented.”
I derive or design new classes as appropriate to the task. For an Android application, I would derive classes from the existing framework, which is purposefully constrained and complete. If I were writing an application to operate within a proprietary OO framework, I would derive classes. But when my task is to solve a problem, I use fundamental object-oriented analysis to develop classes that reflect the nature of the system. This is hard to do and not something that can be learned by rote or conferred by certificate any more than someone can be trained to be a theoretical mathematician. It is possible to teach methods of problem solving but their effective use requires insight and an ability to see patterns that are not obvious until pointed out by someone with such insight. I have been taught formal problem solving methods as well as the mechanics of programming with classes but these would be of little value without a native ability to see hidden patterns.
My ability to see underlying structures and classes that are not immediately apparent enables me to solve very complex problems, often in much simpler (and cheaper) ways than other engineers. As the system architect and software lead on Hitachi’s 747 clinical chemistry analyzer, I was expected to solve an apparently isolated problem (Buzz Off) of simply not having enough keys on the instrument’s keyboard. The prevailing opinion was that we needed to add hardware but I saw the specific problem as symptomatic of a broader user interface problem. My object-oriented design fixed the specified problem without adding hardware even while significantly reducing code size. It also corrected widespread user discomfort that had not reached the point of specific complaints. The most common user reaction to my new design was “This is what I always wanted but didn’t know how to ask for.” This sort of forward error correction often results from good OO analysis, because the specified requirement is frequently just the most important one in a class of similar issues.
In my instrument development system for Abbott I developed an object-oriented instrument specification system enabling virtually all of the hardware and process control characteristics of a complex machine to be defined as instances of generic classes. The knowledge of these classes was embedded in execution units, which comprised several different CPUs programmed in assembler and C; in the control and debugging computer, programmed in C++; and in the two languages, one for process control and the other for system specification, that I designed and implemented using C, BNF (Bison/Yacc), and LEX (Flex). My OO analysis was driven by the application itself and needed no guidance from a specific OO language. In a dramatic example of the potential of OO analysis, the programming to support a new process, which previously required a team of programmers working for six months, was now done by one non- programmer in less than 10 minutes (See Deployment).
I don’t expect every assignment to reap such striking benefits. Some are more constrained, preventing global optimization, as Stroustrup discusses in his criticism of the waterfall model. Some have more limited scope or the benefits may be more subtle. My unified USB-serial touchscreen device driver for WinCE at Elo is an example of this. Realizing that more than half (in fact 75%) of the driver is essentially independent of the transport type, I designed a generic driver class with the transport as a derived class. No customer was aware of what I had done but they were aware that our USB and serial drivers started behaving identically whereas in the past they never knew what to expect.
See my Big Data Analytics Design Principles.
I have considerable practical experience and success solving big data problems but equally important is that my normal approach to programming is well suited to these kinds of problems. They defy cookie-cutter solutions but may yield easily to more fundamental analysis, in which I am very strong. I also am naturally inclined to follow scientific methods, particularly in devising experiments that can definitively answer questions to weed out provably wrong solutions.
The multi-term control theory paradigm that I often use in my programming, for example in my blood cell sorting and touch pointer acceleration algorithms, yields reliable results even with very messy data. I am adept at presenting complex data in ways that make it comprehensible without imposing preconceived patterns, for example in my flow cytometer data viewer (Dataman) and hover touch flick algorithm development process.
Cumulative statistics provide the simplest answer. Considering only major programs that I have developed alone, I have written 541 files; 176,855 lines; 70,396 statements; 2,540 functions; 196 classes. This is about 20% C and 80% C++. My programming contributions to group efforts would double these numbers. These statistics don’t really tell very much. A programmer assiduously wrapping every intrinsic in a class with only setters and getters may appear to have considerable experience while actually having no useful experience.
Cumulative code statistics are especially misleading about my experience because, although I am well versed in C++ details, I use many of its features judiciously. I particularly think that algorithms achieve much better polymorphism than do virtual functions, which are really just glorified control flow. One of my strengths is an ability to devise algorithms where every one else sees only control flow solutions. For examples see my transformation of a huge (and maintenance nightmare) control flow program to a single table-driven statement for the Abbott CD3200; or the multi-term control formula, including terms that essentially quantify user frustration, that I developed to produce ideal displacement acceleration in my miniature capacitive touch pointer for IDT.
Real gains come not from using everything that C++ offers in every situation but in being able to develop effective classes, which requires object-oriented analysis, irrespective of the programming language. I have always done this. I don’t need C++ but I like that it affords useful features to express my intent in a standard form that most programmers understand.
The largest program that I have developed in which C++ plays a prominent role beyond simply effectively representing my object-oriented design is my unified USB-serial touchscreen WinCE driver for Elo. Unlike many other operating systems WinCE is happy with C++ at the kernel level although the usual driver programming caveats still apply. Statistically, this program comprises 75 files; 13,620 lines; 5,374 statements; 81 classes with an average of 5.85 methods per class and 3.1 statements per method; and 156 functions [Code stats]. Certain C++ features are especially valuable here. To achieve my goal of making the serial and USB (actually any physical link) drivers the last derived class from a link-agnostic base, I combine many domains without exposing them all to each other. For several of these, multiple inheritance affords significant domain containment without arduous semantic tricks. For all of them, the ability to move members up and down the class stack during development without having to rewrite any usage code (due to the simpler access syntax of inheritance vs. aggregation) has been very helpful. Link-specialized pure virtual functions deep in the class stack make it easy for me to extend the link-agnostic base much further than I could without them, resulting in significantly more shared vs. link-specific code.
Drivers and Applications in Windows 7 (32/64), XP, 2K, NT, ME, 98, 95, CE, TabletPC, XPe; Linux (Fedora, Ubuntu, Android); MSDOS
Most of my recent commercial programming (IDT, Elo, Abbott Instrument Development System, Abbott CD3200) has been for Windows XP application and drivers. In earlier work, I wrote applications and drivers for NT/2K and all versions of 9X. I worked in MSDOS at Microgenics. For Hitachi’s 747 clinical analyzer I developed a threaded OS on top of DOS (real and extended) with both applications and drivers (see my Doctor Dobb’s Journal article Software Partitioning For Multitasking Communication). At Elo I worked extensively on WinCE drivers and applications, as well as on drivers for embedded XP and Tablet PC. Most of my programming for these operating systems has been entirely original design.
On Linux I have done only a small amount of commercial programming, a
touchscreen driver for Android (Linux kernel 2.6) at
IDT. In my study of Linux I wrote significant programs
(Linux Examples). I particularly wanted to
investigate IPC (Inter-Process Communication) facilities both in System V and
Posix and Linux-specific network programming. My programs are original designs
intended to test and demonstrate the operation of select
,
sockets (UNIX and INET domains; STREAM, DGRAM, and RAW types), services (e.g.
daytime
), System V semaphore, Posix semaphore and mutex, Posix
threads, pipes, fifo, signal, stat, lstat, and segment fault. These programs
dig deep into the topics, for example demonstrating how Mesa-type Posix
semaphores are unsuited for certain situations where the Hoare-type System V
semaphores perform correctly.
I have written significant BASH scripts for Linux and BAT as well as WSH-VBS scripts for Windows. For scripting in general see Programming-Scripting
At Elo I designed original serial and USB HID and Vendor-specified touchscreen drivers for WinCE 4.2, 5.0, and 6.0 for x86, MIPS, and ARM. Using object-oriented analysis and programming in C++, I made the serial and USB drivers as derived classes of a link-agnostic base driver, enabling them to share 75% of their code. [Code stats]
Also at Elo I wrote a Windows TabletPC USB filter driver in which I made a HID-mouse touchscreen mimic a HID-digitizer by intercepting the device descriptor input during initialization and touch event messages during use and modifying them before passing them up the device stack.
At Elo, an existing XP USB isochronous driver had persistent system crashing problems. Some of these were caused by ordinary driver coding errors, which I corrected, but many were caused by the device extension’s sudden disappearance on plug-play and power events. To assist all USB drivers in dealing with this problem, I developed a P+P-safe WDM USB driver framework library. Eventually, Microsoft created the WDF driver framework to address the problem.
For Abbott I developed a high-performance XP
driver system that provides RDMA (Remote DMA) with zero data copy, no ring
transitions, and no semaphore overhead. It supports multiple simultaneous
applications, which can connect and disconnect at will and receive only the
messages that they register for (content-based routing in the kernel driver).
The driver comprises a kernel SYS and an application DLL shared by all
applications. The kernel driver is componentized to reduce the effort to
support different physical links. I created drivers for two Abbott proprietary
links, HDLC and ECP, but the driver is ready to support directly shared memory,
TCP/IP, USB, and others.
See [Windows Device Driver]
[Code stats]
[Multitasking programming]
For Hitachi I developed polymorphic table-driven x86 assembly language device drivers. Application-level code changes table entries to provide dynamic anticipatory strategic responses, enabling the flexibility of a very fast RTOS with no overhead. I describe this and my lightweight threads for DOS in Software Partitioning for Multitasking Communication published in Dr. Dobb’s Journal. Later, when Hitachi changed to a DOS-extender (Phar Lap) I developed a macro library that enabled the ASM drivers to be automatically built in complementary real and extended forms from a single source.
At IDT I worked on a Linux touchscreen driver for Android. I did not design the basic driver, which was a common open source program, or the initial adaptation for our device. I did correct bugs and make general improvements in the program design. I did all of the work in an Ubuntu host system. The target was a “Beagle Board” (TI-OMAP) with 2.6 Linux kernel and Froyo and Gingerbread releases of Android.
To support my development of a new touch device at IDT, I designed and implemented a generic development kit. My real-time control unit, using an NXP LPC1342 (ARM M3) controller, communicates with a host computer via USB and with various embedded devices via I2C or SPI. In its simplest usage model, it is a USB to I2C/SPI bridge, but this functionality is implemented in a real-time operating system that I designed for broader usage. It is a simple round-robin task dispatcher and derives its real-time performance from real interrupts, DMA, and coprocessors associated with each task. The tasks are state machines with transitions on underlying events. A two-level state machine implements I2C communication. The upper level transitions on the completion of a transaction, while the lower performs transactions. SPI is only a single level because spin-locking to perform (or fail) a transaction consumes less CPU bandwidth than interrupting on completion.
As part of my Abbott Instrument Development System, I designed a stepper motor control system, in which one MC68340 (Motorola CPU32-based micro-controller) and an FPGA (Altera EPF10K30) directly control as many as 20 synchronized stepper motors. I designed the complete system, including FPGA logic, but implemented only the firmware myself. In soft real-time (in a periodic time-triggered ISR) the CPU calculates step pattern (and power) changes for all active motors in a given period. It writes these in the form of a script into a RAM page in the FPGA. The FPGA then plays this in precise real-time. Two ping-pong pages ensure coherency. Motor control is efficiently distributed by a differential RS422 SPI scan chain, which I also designed. Actual motor control signal changes occur simultaneously on a broadside load signal.
To create the first consumer-friendly satellite television receiver I designed and programmed a set top box in which one M6801, an 8-bit controller, performs all functions, from hard real-time hardware control to sophisticated AI (Artificial Intelligence) to quickly find and identify satellites and radio carriers. To keep costs down, I implemented all major control systems using feedback instead of precision. I designed a real-time OS around the fixed timing constraints of the multiplexed LED displays. In each time slot, the display state machine advances and one or more other real-time functions are performed. Periodic real-time functions execute completely in their time slot. Those that are aperiodic or consume too much time for a slot are divided into state machines. The fixed utilization schedule simplifies timing analysis to guarantee critical real-time task completion. AI functions, such as finding satellites and sub-carriers, execute in the foreground. The controller’s tasks include:
I designed and implemented an in-circuit emulator for the Z8 micro-controller. The target, of course, was a Z8 but the emulator’s own controller was also a Z8 although my emulator also supported other targets. These two Z8s required completely different programs. I also designed and implemented a communication buffer/translator using the Z8’s I/O ports in a unique exercise of its port flexibility to directly interface to DRAMS. For both of these projects I wrote extensive Z8 ASM firmware with unique hardware/software interaction. See:
At Elo most of my work was programming host applications and drivers for USB touch screens in Windows XP and CE. For XP I worked on existing drivers. For CE I developed new drivers. These two operating systems are very different. The WDM (Windows Driver Model) in XP is layered and object-oriented, whereas CE has a traditional flat driver model more like Linux. One of my significant challenges was that not all of Elo’s USB concepts, developed in an XP environment, translated very well to CE. In particular, while in-band (touch) communication was HID-mouse, control communication was USB-vendor. In both XP and CE, the operating system takes ownership of a HID-mouse, preventing normal out-of-band (IoCtl) communication with it. Under XP, it is relatively easy to get the USB handle of a HID-mouse for this communication but under WinCE it is officially impossible. I couldn’t change the device firmware and had to devise a means for WinCE to duplicate this capability in my driver.
I did not develop USB device firmware at Elo and usually had no need for detailed protocol analysis. However, I occasionally worked on problems that required real-time USB transaction trace, for which I used a CATC analyzer. A long-standing unresolved problem was lost untouch. It occurred so rarely that no one had been able to determine whether it actually existed but the consequences could be life threatening. CE afforded greater control over USB communication timing than XP so I used it as the platform for a special driver I conceived to try to force the problem (that is if it actually existed and was due to a device firmware error). Using this driver with CATC I was able to prove that the problem did exist and was in the firmware of Elo’s most popular USB device.
Also at Elo I fixed blue-screen bugs in an existing USB-isochronous XP driver, eventually developing my own framework to prevent their increasing occurrence (experienced by everyone) as USB power modes became increasingly complex. I developed an original USB-HID driver for TabletPC (much like XP) to make Elo’s HID-mouse devices appear to be HID-digitizer-pen. This entailed two parts, a filter driver to intercept and modify the incoming HID report descriptor when the device attaches and a device driver to translate touch reports into equivalent pen reports. This was not a simple connectionless translation because touch and pen are functionally different.
At IDT I designed original hardware, firmware, and software for a USB device to develop and demonstrate my differential capacitive pointing device. I designed a generic architecture in order to support many of IDT’s other products. Most of these have I2C and SPI interfaces, which my prototype device also required. At its core, the device is a USB to I2C/SPI bridge based on an NXP LPC1342 (ARM M3) controller. It can be used for just this but I designed its firmware as a general-purpose real-time operating system that can do much more. I designed the lower-level (essentially device driver) host software for my pointer device and the device firmware in such a way that complete functions could move back-and-forth between the two. Thus, control algorithms, e.g. acceleration, developed in a Windows environment can be moved at any time to the device to improve real-time response and to advance toward a producible IC. For convenience during development, my device identifies itself as HID-vendor but it can switch to HID-mouse when ready.
The ASTEC satellite television receiver, for which I designed all digital hardware and firmware, used dedicated devices for RF tuning (PLL, VCO, polarizer), audio volume, and IR communication. Each of these had its own flavor of SPI. I added my own flavor in the form of serial shift registers that I used for scanning the local keyboard and driving LED displays. I designed (in Palasm) a PLD serializer with dynamic SPI protocol control to enable my MC6801 firmware to efficiently communicate with all of these devices.
For the Abbott instrument development system, I designed an SPI-based loop for distributed I/O, including step motor drivers. I used differential (RS422) transmitters and receivers for fast, reliable operation. Most of the MISO bits are used for actual inputs but the far ends of the MISO and MOSI are joined to enable a portion of the output to be read back by the controller as an integrity check. The controller, implemented in an FPGA, runs the chain continuously while constantly checking integrity. Dual-port RAM in the FPGA enables the CPU (mostly MC68340) to read and write I/O registers, simulating direct parallel I/O. The actual I/O devices include simple shift registers (74HC595/7), relatively simple I/O implemented in programmable devices (Verilog Serializer-Deserializer Verilog Complex SPI Device, and complex coprocessors with flow-through serial interfaces.
For my IDT USB Development/Demo Kit I designed a USB-I2C/SPI bridge using an NXP LPC1342 (ARM M3) controller. I designed a simple real-time operating system to handle application-specific tasks in parallel with the basic bridging function.
See my sample handling robot video.
There is no precise universally accepted definition of robotics but most engineers would agree that certain features tend to characterize a robotic system, the most obvious being complex mechanical movement, response to environmental events especially from multiple sensor types, and the ability to take actions that are not preprogrammed but formulated in real time based on some strategy.
Robotic machines are commonly thought to be more expensive but more adaptable than fixed-function machines. This is usually true for a repetitive task that can keep a fixed-function machine fully utilized. In this case all superfluous movement, strength, and control capabilities can be eliminated. At the other end of the spectrum are situations, for example an autonomous vehicle, that obviously demand actions based on real-time strategic analysis. But even in these cases, the robotic system operates on top of fixed-function sub-systems. Between the two extremes are situations with some requirements that might be addressed by a more or less robotic approach. In these situations, standard engineering principles rule. The best solution is the one that is most reliable and cheapest and that is not necessarily the fixed-function.
Reliable fixed-function machines require precision (repeatability). Precision is usually expensive even in solid-state domains like electronic circuits, but it is very expensive in moving mechanical systems. To eliminate the need for an expert to program geosynchronous satellite positions during the installation of Astec’s satellite television receiver, other engineers proposed various high-precision schemes, none of which could be realized for less than thousands of dollars. No human, regardless of expertise, is capable of such precision so I asked an expert installer to show me how he set up the positions. It was immediately apparent to me that I could write a program to effect the same strategy. This robotic solution cost virtually nothing. It afforded better performance than the precision fixed-function alternatives because it corrected for seasonal ground shift. The movement of the dish while searching for satellites revealed human-like thinking that fascinated observers and generated valuable publicity for the product (and a patent for me).
A robotic design philosophy doesn’t always reveal itself so obviously. The instrument development system that I created for Abbott was descended from a product line of much less flexible instruments. Each instrument was, in a sense, a fixed function comprising domains so tightly bound together that even minor functional changes required major system redesign. There was almost no sharing of even functionally similar subsystems throughout the product line. My new system design essentially transformed the entire product line into one highly adaptable instrument. This instrument did not adapt in real-time, which would reveal its true robotic character, because that would be too expensive.
The fixed-functionality of the older instruments was ingrained. Each subsystem, for example motor control and data acquisition, was realized as a unique entity with its own interface, functionality, and means of control as well as the essential hardware. If this tight coupling were used as part of a control locality strategy it could be of some value but, in fact, it did the opposite. For example, the motor controller was not a motion controller. Moving a motor to some position indicated by a sensor required a central CPU to poll the sensor and tell the motor controller when to stop. Distributing motor control capability instead of centralizing it could improve control locality but would be more expensive to implement and would not eliminate coupling but simply make it finer grained.
I repartitioned the subsystems, moving all interface and functionality, i.e. essentially the control API, into a unified abstract control language that all subsystems understand. An interpreter for this language can be realized in less than 20 Kbytes of program code. To effect the physical operations implied by this abstract language may require significantly more code and hardware but this can be distributed in any way appropriate to the specific system. Any subsystem can move a motor, but if it doesn’t locally control that motor, a message is automatically sent to the unit that does. Thus, the concept of control locality is ingrained even if the physical implementation does not actually provide this. Making it real will cost money but will not disrupt anything. The ability to mix virtual and real control locality makes every subsystem a fully capable and flexible robot without incurring the cost of a real implementation.
The benefits of this design become apparent with use. For example, the random sample handler requirement, unexpectedly imposed on a specific instrument in the middle of development, required much greater precision than anything seen in earlier instruments. The coordinated ramping of multiple steppers would normally require a dedicated machine or a sophisticated motion controller but my robotic mesh handles it as just another task. All operations have this level of precision. If data acquisition is supposed to begin as soon as a valve is fully open then it begins with less than a microsecond delay. If multiple actions are supposed to occur simultaneously, they really do.
For Windows applications I have used Visual Studio. For Windows drivers I have used SoftIce for local debugging and windbg for remote. SoftIce was always the best single-machine driver debugger but its remote debug was not reliable. It is a moot point now because XP is the last Windows version that would accept SoftIce. I have used WinCE Platform Builder to debug both applications and drivers (one of the very nice features of WinCE is that a single debugger works at both levels). I have used command-line GDB to debug local Linux applications and remotely through serial and Ethernet terminals to debug Linux drivers. I have used GDB as a plugin to Eclipse and in an emacs shell to debug applications. For quick testing I use GDB from a terminal window or emacs because they start up much faster than Eclipse. I have used Kiel and Eclipse for ARM programs. I have used many CPU-specific proprietary debuggers for embedded controllers.
All of the GUI-based debuggers are part of an IDE. They are all essentially the same and easy to use. GDB is unique. It can be used transparently as a plugin or through its own API, which is harder to use but far more capable. GDB’s API affords almost unlimited flexibility, with extensive native data presentation and the ability to trigger target and host programs in response to target events. I have used some of this capability. All of these tools enable debugging without any preparation. Breakpoints can be set practically anywhere; we can step through code; variables’ changing values can be monitored. This not only makes them convenient for debugging when we generally know the root cause of a problem but also when we have no clue about the root. Unfortunately, not all problems can be addressed in this way.
Much of my work has been in areas where convenient intrusive debugging methods are not useful. I often work on kernel level code that cannot be stopped without crashing the system. Even where stopping wouldn’t be disastrous, such as with touch screen drivers, most of the hard problems are time-dependent. My non-real-time work has often involved big data analysis, where it is very difficult to find a relationship between functional problems and single-pass code execution. Consequently, I have been forced to develop alternative debugging techniques.
Whenever possible, I make test facilities a permanent part of my code. This affords several advantages. One is that not having to modify a program to try to find the root of a run-time problem avoids changing the nature of the problem, which often accompanies such changes. Another is that the hardest problems to fix are ones that occur infrequently. When these do occur, built-in facilities may help find the root cause while the system is still in a state caused by the problem. With big data analytics, there is no such thing as an “error-free” solution but just good enough to be useful. Frequently, new relationships are discovered during use. Built-in facilities can significantly accelerate this process.
IPC (Inter-Process Communication) bugs are often difficult to diagnose and I pay particular attention to built-in instrumenting of complex IPC mechanisms. I parameterize my buffers so that buffer size, message minimum and maximum sizes, and other characteristics can be easily changed, in some cases without even having to stop the program, which can make corner-case testing a routine regression test. In my Abbott instrument communication system a kernel-level ring buffer has one producer, a kernel-level driver, and multiple consumers, application threads and/or processes (see Multitasking). The driver effects content-based routing, alerting the appropriate consumers. To simplify corner-case testing, I made an IoCtl interface for a test application to inject messages through the same low level kernel mechanism as the driver itself. This enables a coordinated test attack, synchronizing multiple consumers with incoming messages.
Software error-induced “crash” is usually not a fatal condition but just an endless loop either waiting for an event that never happens or in some unintended code. In the robotic mesh operating system that I designed for the Abbott instrument development system each subsystem executes multiple simultaneous scripts written in my procedural language. A crash can occur in any one script, in the entire scripting system, or at several levels of the downloaded operating system or built-in bios. I associate with each of 10 different crash levels, a response mechanism and means of communicating the problem to humans and a reset mechanism, which clears the crash state at that level without affecting lower levels. At all but the lowest OS level, complex host communication is working and is used to indicate the run-time location/activity at the time of crash and the program source file name and line number of the responsible code. At most of the bios crash/reset levels, a more primitive communication mechanism indicates current status and immediate history.
When I started work on the CD3200 for Abbott, I inherited a code base that was functioning well on older instruments. However, I noticed that it contained a half-dozen smoothing functions, all used in different circumstances. It seemed to me that they were all theoretically doing very much the same thing but whether the subtle differences were significant to the outcome was not clear. All of the functions worked according to design but superfluous differences might be interpreted as the root of a problem, delaying discovering and correcting the real problem. This is a classic big data problem, which can only be answered by presenting the data in a usable form for a human to analyze. I wrote a GUI (Win32) program specifically for testing these and other large data manipulation functions with easily varied parameters, such as number of iterations and number of polynomial terms in FIR filters. Test data can be actual instrument data or synthetic, such as a Gaussian or Poissonian distribution with varying levels of random and/or periodic noise. With this, the algorithm developers showed that most of the algorithms’ differences were irrelevant. Later, when the algorithm developers began to suspect that scatter plots and histograms used internally and then discarded might be at the root of poor results, I implemented a data trace facility in the normal instrument control program. With this, the intermediate data sets can be captured without interfering with normal operation and automatically reconstituted as plots off-line. This is essentially a big data logic analyzer.
In order for the miniature touch pad that I developed at IDT to correctly interpret flicks I had to develop unprecedented algorithms for position information that is widely assumed to be unusable. This is a big data problem. I built into my demo/development program (Win32 GUI) a recorder to capture the touch position and strength data stream of various events. I wrote an AWK script to display this as time-based trip plots using OpenOffice Calc. From these plots I was able to see useful patterns, which would not have been evident from looking at the data as a stream of numbers. See my Touch Flick Algorithm Development.
Sometimes the only means available to debug a problem is “burn and crash”, a reference to the early days of embedded programming when a program would be “burned” into an EPROM and then the system executing this would crash, hopefully revealing something about the problem. At Elo, I was responsible for all WinCE touchscreen drivers and support applications. We supported x86, ARM, and two versions of MIPS but I had a test system only for x86. My USB driver had been deployed without incident on customers’ x86 and ARM systems but crashed for a customer using MIPS. I was located in California and the customer in New Zealand. I devised a “burn and crash” debug scenario with a series of test drivers to, in effect, do a binary search to ferret out the problem. None of these were intended to fix the problem but to reveal its location. In less than two hours the problem was revealed and fixed. The customer never did understand that I was not sending him programs to fix the problem but to find it and, after testing each one, would say “that one also didn’t work”. In fact, each one worked exactly as I intended.
FDA, ISO, and all organizations trying to promote good design practices require certain specific documents as a sanity check. No one thinks these alone adequately capture the design of a product, in the same way that an audit checks a very small sampling of procedures in the hope that they are indicative of the rest. Some companies have a documentation specialist whose main job is to essentially lie in order to fill in the blanks in a required document or to pass an audit. This is like cheating on your homework. If a company has a good process and properly documents it, producing required documents and passing audits should be trivial.
An engineering focus on specific required documents can do more harm than good. Required documents have to be in a form that can be examined and evaluated. They are by nature disconnected from other documents and from their own history. If they drive the design process, as is often the case for medical products, they either force an ineffective waterfall model, require constant updating, or they become disconnected from reality. The only required documents that resemble real design are the Design History File and traceability matrix. Design is chaotic. Unexpected connections happen. Old assumptions are proven wrong. Circumstances change. Refusing to accept this reality is a recipe for failure. We know that it is easier to program in a language suited to the application than to force fit the wrong language. We should realize that this applies to design documentation as well. The only “language” that reflects the true nature of design is a hyperlinked web that captures the rationale for every decision both currently and in the past.
A single hyperlinked infinitely expanding web containing the rationale for every decision from the inception of a project is the only documenting system relevant to engineering. It is a living document, which guides and helps us do our job properly. No other requirements should be imposed on it, but our promise to document everything is essentially a contract with all other document systems. Any document can be derived from this. That derivation is not a job for design engineering but for QA. Design History and Traceability are the only documents that are not derived; they are the project web.
This can be done. I did it at Abbott. For example, see Design History Document. I did most of the actual writing but others saw the benefits and also wrote documents, which I linked into the system. Equally important was the response of non-documenters. For example, in one weekly meeting an engineer pointed out an error on page 27 of the previous week’s minutes. For an engineer to read 27 pages of meeting minutes he must be finding real value in them. In another meeting an engineer wanted to discuss an important but obscure point and identified it as item 2.4.7.13.9 in a particular web page. We all laughed but then nodded our heads in recognition that it was uncommonly good to be able to so succinctly get everyone on the same page.
As I discuss in Design Principles I think that both waterfall and agile have useful suggestions and I have used many of them. However, I don’t think that either one alone is realistic. Agile makes no sense without an overall strategy, which is a waterfall concept, and waterfall fails if the organization insists that all decisions at each level are perfect and cannot ever be challenged. The Rationale model is the only one that I know to be right for all circumstances. This is essentially the IBIS model originally proposed by Horst Rittel although, for me, it evolved from practice rather than theory. It doesn’t contradict either waterfall or agile but deals with an issue that both of them handle poorly, which is how to rationally revisit decisions when circumstances change.
In many companies there is strong institutional pressure to do everything faster and cheaper and little recognition of the fact that maintaining a program is more costly than its initial development. This puts a lot of pressure on programmers to meet immediate requirements by cut-and-paste and to “clean up” by obscuring their mess behind a supposedly object-oriented facade. This imparts a linear complexity to maintenance. Every change requires the same effort. In the long term it is better to analyze the requirements without being constrained by a presumed language and tool set and then choose from the tools available. This increases the time needed to meet the first in a series of related requirements or changes but only if the simple solution is right the first time. Often it isn’t and with each iteration, the complexity of the supposedly simple solution approaches and sometimes exceeds that of the more analytical approach. In any case, a good analytical solution usually pays for itself on the second similar requirement. This is analogous to the difference between a bubble sort, which may be the fastest when sorting a few items because its O is small, and quick or heap, which are much faster on a few more items.
The purpose of analysis is to devise a higher level design that is more efficient and flexible than randomly mapping each requirement to some piece of code. The ideal outcome, unfortunately rarely achieved, is a pure algorithm operating on virtually infinite data sets. This affords infinite polymorphism. Nearly as useful and much easier to develop is table-driven polymorphism, where generic control flow is specialized by enumerated data organized in a table for efficient lookup. This is a very efficient form of object-oriented programming. Each row represents an instance of a class whose members are defined by the columns. If a problem can’t be mapped to a pure algorithm or table-driven polymorphism a broader object-oriented analysis can be applied.
I have developed a few pure algorithm solutions, for example the multi-term displacement acceleration computation for my differential capacitive touch device. Some of these terms are essentially measures of user frustration, a concept that cannot be realized by simple control flow. This particular algorithm has deterministic execution time, which is particularly valuable in this real-time control application. The time effect of any new term is predictable.
I routinely design table-driven polymorphic programs. Sometimes, these are relatively small pieces of a large design and their superiority over standard control flow is not immediately obvious. In some cases, the advantages are striking. For example, I wrote an x86 assembly language driver for my table-driven instrument communication system for the Hitachi 747, yet an application-level programmer with no knowledge of assembly language found an error in my code by recognizing an anomalous control data pattern in the table. See my Doctor Dobb’s Journal article. For the Abbott CD3200 we inherited a complex control flow solution to a big data analytical problem. I transformed this 2000-statement, 16-level deep sequence of nested functions into one statement with a small table, drastically reducing the complexity not only of the current solution but also of any future changes.
An important principle in object-oriented analysis is to concentrate on the needs of the application rather than the components that might be used to meet those needs. For my Elo WinCe touchscreen drivers, I determined that the link type was not a fundamental issue but an easily replaced component, especially after I had defined generically what was expected of any physical link. I also did this with my instrument communication system for Abbott. Where most applications would contain significant link-specific code, any programs wanting to talk to an instrument in my system invoke generic communication class methods provided by my DLL (Dynamic Link Library). An application can suggest physical link details but if a different link has already been established by another program or the suggested details are wrong, the application is simply given access on the functioning link. Applications never have to know the physical link they are using and don’t have to be modified for new types. Initially, two physical link types were supported, a proprietary HDLC and ECP (bi-directional parallel port with DMA). However, a direct DMA connection (in addition to Ethernet and USB) was anticipated for very high performance systems and it was important that the communication system not impose the overhead needed for external communication but not needed for direct DMA. Therefore, I entirely bypassed the standard IoCtl interfaces (file read/write and DeviceIoControl) implementing instead an application-kernel shared ring buffer with less than 1% of the standard overhead.
No matter how well a program anticipates future needs, it is unlikely to provide a complete solution for significantly new requirements. It is important to thoroughly document every design decision to provide guidance. Otherwise, it will be very difficult to determine whether a change can be made without breaking the existing design, whether apparent errors are oversights or motivated by obscure information, whether a relevant detail is predicated on circumstances that have changed, and many other questions. I address this with my Rationale development model, in which a hyperlinked web affords a traceability matrix with 100% coverage of all decisions from the beginning of a project.
It is important to identify and stay focused on the unique thing that we want to deliver. Many engineers don’t do this, instead building around some incidental feature, such as USB, user interface, or a specific language. That approach creates systems that are hard to maintain and cannot adapt to changing conditions.
I partition systems into a core, which is the unique product, and supporting domains with an explicit API derived from the needs of the core rather than any specific technology that might be used to realize that domain. This is a form of strength reduction, which simplifies maintenance by clearly stating both what is needed and what is not needed.
Even if it seems unlikely that the technology initially chosen to realize a supporting domain would be replaced by another, this partitioning provides intrinsic maintenance guidance. New requirements and most error corrections are obviously in either the core or the supporting domain. We immediately know where to put our efforts and we don’t need to disturb other domains.
The domain API approach also provides guidance in determining the effort that would be required to adopt a replacement or supplemental technology, for example, to expand the market for a USB-connected product into a Bluetooth environment. In a USB-centric design, the entire design will be infused with unique USB characteristics, making its replacement without a complete redesign nearly impossible. To avoid this, companies will employ tunneling, for example wireless USB. But this is an inefficient and often single-source patch on a fundamentally flawed design.
Domain API design always reduces the effort and cost of system maintenance but usually requires some extra initial effort. For one thing, not everyone can do it successfully. Done poorly, it will just create unnecessary complexity. In some cases, for example inherently remote communication, the level of genericity is clear. In others, it can be a very difficult architectural decision. In my Abbott instrument communication domain design, supporting an existing proprietary HDLC transport was a definite requirement. Several other transport mechanisms were anticipated but all were some form of remote communication. Only real DMA (or its equivalent multi-core or co-processor) was different and it was not a requirement. However, I calculated that the amount of data streaming from anticipated instruments would exceed the bandwidth of Windows’ application-kernel interface mechanisms, irrespective of the transport means. Either a radically new architecture would be needed for those instruments or my communication API would have to break this barrier, which is what my remote DMA design does. Without doubt, this is far more efficient for the remote transports but, as Amdahl’s Law teaches, those are so slow that this improvement is irrelevant for them.
If one system’s sub-domains affords multiple technologies but is not designed as an API architecture, it is difficult to keep the various flavors synchronized. For example, at Elo, I inherited a WinCE touchscreen product line with separate serial (UART) and USB drivers. OEM customers often used both, sometimes even in the same system, based on physical or legacy requirements. The serial and USB versions had arbitrary functional and API differences unrelated to the communication means. The transport means that drove the two designs had infused throughout the drivers so that an improvement in one could not be easily duplicated in the other. I couldn’t simply start over because our OEM customers had hundreds of products and millions of units depending on the drivers’ actual characteristics whether stated in our manuals or not. I combined all documented and actual (determined by experiment and code analysis) characteristics of the two drivers into a superset requirements list. I then designed a new unified driver to this specification (plus improvements) with an internal transport API and structure predicated on the needs of the touchscreen. After this, any changes not specific to the transport inherently accrued to any flavor of the driver.
Of course that depends on who I’m trying to collaborate with. People differ. I try to accommodate those differences without sacrificing my own integrity. There are many ways to coordinate group effort but none of them should be allowed to abandon truth, logic, and reason. I oppose intellectual tyranny, whether by a group or by chain-of-command authority. However, I get a lot of pleasure from working with others and enjoy compromising, even if it means accepting something that I think is not necessarily the best solution. If I always insist on the best solution (I don’t necessarily mean my own) people can feel shut out of the process. Most of my work has been in routine collaborations. A few unusual instances are notable, not because they represent the usual situation but because they reveal my collaborative character.
At Abbott our project was assigned a technician who had a reputation for being unmanageable. If we were to come to the same opinion, she would be fired. I asked her to implement a complex wiring diagram that I had designed for a new machine. She was done in a few days. When I inspected the machine I saw that she had deliberately changed my design. When I asked her why, she replied that her design was better. I asked her to show me. She did. She was right. While I don’t believe that my authority makes me right, I would like to at least be consulted, but I didn’t say this to her. Instead I said, “You are right. It is better. Thank you for taking the initiative”. If she hadn’t been right she would have had Hell to pay, but good work is more important than authority. Even though it was clear that there would be no negative consequences, she never again changed anything that I gave her to do without discussing the issues with me and accepting the collaborative decision. She became an extraordinary asset to the project, solving problems that eluded every one else, including me.
I was hired by Elo because I was so deep into Windows device drivers that I could do things their own experts couldn’t do. They needed me to talk to computers and would not have cared if I had been unable to talk to people, but I told my manager that I liked customers and would welcome any opportunity to work with them on technical problems. A month later he said, “You did say you like working with customers— how about a fire-breathing dragon?” This was a very important and very angry customer. I told him that angry customers are my favorite kind. They are usually angry for a reason, which means there is a real problem to solve, and they have an urgent need, which means that my solution will not be wasted. In our first teleconference with the customer and Elo representatives from around the country, the customer was clearly unhappy and talked quite a bit before anyone else said a word. By the time I was introduced I had already learned a lot about the situation. I said that it appeared to me that the problem was specific to XPe (embedded); that I was an XP device driver expert but had no experience with XPe; and that I was dedicated to fixing his problem but needed his help because he knew more about XPe than I did. For the rest of the week he and I collaborated by email. The problem was difficult but finding its root was only the beginning. The system was a medical product that had already been approved and deployed. There could be no perfect corrective action but we converged on one that would work. We had explored the alternatives and were satisfied that we had not overlooked a better one. At the start of our wrap-up teleconference, the customer immediately began talking. He said, “First, let me say to David that working with you was a great pleasure and I learned a lot from you”. I replied that I had intended to say the same thing to him. He went on to tell the rest of the participants that he could get cheaper product from Elo’s competitors but he could not get that kind of collaboration from them.
The more competent someone is in their own job, the more they like working with me. The ones at the other end of the spectrum don’t like working with me. For example, I had just joined a group that was exploring ways to improve a complex machine. Before anyone told me anything about the machine, I had examined its system design and predicted that it would have control timing problems. The basic architecture was fundamentally flawed and could not be corrected by any sub-system change. In my first meeting with the group, they identified control timing as the most important problem and proposed correcting it by replacing all internal communication with isochronous USB. I told them that isochronous cannot be used for control because it sacrifices guaranteed delivery for timeliness. When they protested, I suggested that they read the USB Implementers Forum description of isochronous, which I was essentially quoting. Despite the fact that someone did this and verified what I had said, the group officially reported “On the issue of isochronous USB, we agree to disagree, but we all agree that David is difficult to work with.”
I was working as a programmer at one company when the software manager asked me to “take over” a project, that is to forcibly take it away from the programmer who had designed and maintained it for more than a year, apparently to the satisfaction of everyone. However, now the program was not meeting specific criteria for an imminent (one week away) customer demonstration and the programmer seemed unwilling to listen to suggestions. I told my boss that I might have to completely redesign the program, which could easily take months. I suggested that an alternative was to try to discover the root of the collaboration impasse while investigating the specific performance problems. A sprint involving the stakeholders would be the best way to do this while at the same time providing me with the background to take over should the impasse remain unresolved. My boss was sure that the impasse would remain but agreed that the sprint would help me. I directed a one-week sprint, at the end of which all major problems had been fixed, the programmer had learned how to collaborate, and the demonstration was back on schedule. After this incident my boss began describing me as uncooperative and not a good programmer.
For much of my career I have been a consultant. When an assignment ended, unless the client had real work for me (some have wanted to keep paying me just to know that I would be available) I would leave. Before making myself available for another job I would study something in depth. Engineers are expected to be continually learning but I have often seen others learning on the job and making an awful mess while not really learning very much. In contrast, to learn C programming I spent three months on K+R, not because I am such a slow learner but because I did every exercise in every way possible on multiple computers, operating systems, and compilers in order to benchmark everything. I spent three months learning C++ in the same way. Obviously I could not have done this on the job, yet all of my work has benefited from my deep understanding of these languages.
I have several times worked as an employee. I have never quit a job but one company folded and others had general or division layoffs, leaving me to choose whether I would immediately get another job or continue my sabbatical habit. I have usually done the latter. The result is that I have unusually broad and deep theoretical knowledge as well as practical experience leveraging this knowledge. I have also produced some useful freeware such as Dataman, a big data analytics tool for flow cytometry, which probably would not have been commercially viable because of the small market. See Dataman screencast
I am aware intellectually that our brain works best if periodically allowed to
wander but I focus very intently on the problem at hand and almost can’t let
go. Most people seem to be able to escape by taking a walk. For me that is just
an opportunity to focus even more intently on my work. Quite by accident I
discovered the cure. I had learned to ride the unicycle when I was a teenager
but I was never particularly good. I stopped riding when I went to college.
Many years later a colleague at Elo had gotten interested in unicycles and I
volunteered to demonstrate. I could still ride, poorly as before before.
But when I stopped, I realized that it was the first time in years that
I was not trying to solve any problem other than staying upright. I am now
addicted to extreme riding: backwards, up and down hills, through rutted
fields, through impossibly tight obstacle courses.
See me riding on youtube.com/OneFreeBrain/Short Spin