\documentstyle[11pt,a4wide]{article}
\def\upto{\mathbin{\ldotp\ldotp}}
\def\aside#1{\noindent {\it[#1]}}
\def\EG{EuroGam}\def\VXW{VxWorks}
\def\signed#1#2{{\unskip\nobreak\hfil\penalty50
  \hskip2em\hbox{}\nobreak\hfil#1\quad#2
  \parfillskip=0pt \finalhyphendemerits=0 \par}}       % TeXBook, p106

\title{Diagnostic Procedures for EuroGam}
\author{David Brightly and Patrick Coleman-Smith}
\date{Edition 1.0\\January 1991}
\begin{document}
\begin{titlepage}
% a front or cover page for EG docs suitable for inclusion inside
% a titlepage environment in a LaTeX document
% for use with Plain TeX uncomment the \nopagenumbers and the \eject

\hoffset=0in
\hsize=6.25in
%\hoffset=.5in
%\hsize=5.25in
\vsize=10.25in
%---gives left and right margins of 1.5in
\font\small=cmssbx10 scaled \magstep2
\font\medium=cmssbx10 scaled \magstep3
\font\large=cmssbx10 scaled \magstep4
%\nopagenumbers
\hrule height 0pt
\parindent=0pt
%\parskip=0pt
\large
\rightline{EDOC???}
\vskip .5in
\large
EUROGAM PROJECT\par
\vskip 1.5in
\hrule height 2pt
\vskip 20pt
\large
NSF DATA ACQUISITION SYSTEM\par
\vskip .5in
\baselineskip 25pt
Diagnostic Procedures For \EG\par
\vskip 20pt
\hrule height 2pt
\vskip 1in
\medium
Edition 1.0\par
\vskip 5pt
January 1991\par
\vfill
\medium
NSF Software Systems Group\par
\vskip 5pt
UK Science and Engineering Research Council\par
\vskip 5pt
Daresbury Laboratory\par
\vskip .5in
%\eject
\end{titlepage}

\maketitle
%\noindent [This is a first draft of the complete diagnostics paper.
%I'd appreciate your comments. I'd like to arrange a meeting early
%next week just to go over it together before publishing it.
%Let me know if you can't make next Monday or Tuesday (14th and 15th 
%January).
%Author's remarks and asides are in square brackets.
%\signed{D.B.}{Thursday 10 January 1991]}

\section{Introduction}
This paper is the result of discussion between
John Alexander,
Mike Bentley,
Dave Brightly,
Patrick Coleman-Smith, and
Ian Lazarus.
We started with the question ``what do we, i.e., the support
staff, do when an experimenter complains that the \EG\ system is
refusing to acquire data?''
Three main avenues of approach emerged:

Firstly, we should investigate the feasibility of a ``system
confidence test'' of broad coverage.
By this we mean an automatic procedure designed to demonstrate
that the \EG\ system hardware and software can be put in a working state.
This procedure would be at liberty to displace any or all of the
user's current setup. 
It would exploit the Ge and BGO test pulse generators,
the trigger unit driven ``synchro-test'' VXI line, and
relatively lax timing windows to simulate a controlled set of
physics events and check that they are correctly processed
by all system components as far as data storage.
It would require no physical movement of connectors,
no user intervention, and be completed in perhaps 1--5 minutes.
If the confidence test succeeded we would conclude that the system
was in working order and that any problem lay with some aspect of
the user's setup, detectors, target, beam etc.
Were the confidence test to fail we would conclude that indeed some
fault existed and would be obliged to adopt some methodical
procedure for tracking it down.

Our second avenue of approach is to investigate the feasibility of
quasi-automatic fault identification procedures.
The object here is to isolate the fault within a single replaceable
unit by means of a sequence of tests which extends the set of
components known to be working until a fault is observed.
Though we cannot guarantee to locate every exotic fault that might occur
we believe that given the diagnostic mechanisms currently envisaged
for \EG\ modules, together with some further ones which we
have identified, we can provide a useful set of fault-finding tools.
We would hope that these can be made sufficiently intelligible
to experimenters that they can use them to diagnose problems
occurring out of normal office hours without expert assistance.
We expect that many of these tools will evolve from stand-alone
module commissioning test programs.

Our third line of attack is to provide tools for checking a user's
setup.
The \EG\ hardware contains many user-settable parameters which
critically affect data acquisition and we appreciate that a tyro
user may well overlook or miss-define one or more of these.
We therefore propose to investigate the possibility of tools
for inspecting the user's setup (by interrogating the modules themselves),
producing summaries, highlighting potential errors, and otherwise 
aiding the user in getting his setup right.

These ideas are examined in more detail in the subsequent sections.

\section{System confidence testing}
The object here is to demonstrate convincingly that the \EG\ data
acquisition system is in working order, assuming that this {\em is\/}
the case.
In contrast to the tests described in the next section we make
no preliminary checks of basic functions at all.
Instead, we establish a viable setup configuration in the frontend
crates and use the frontend cards' test modes to generate and process
a predetermined sequence of pseudo-events.
We can check for correct operation of the system by inspecting the
stored EbyE data and the resulting singles and sorted histograms.
These checks must be of the form:
\begin{itemize}
\item expect a peak of $n$ counts within the region $[c\upto c']$ of 
      spectrum $s$;
\item expect events with sequence numbers $n_1\upto n_2$ to consist 
      of channels $a_1,\ldots,a_k$ with conversions in the ranges
      $[c_1\upto {c_1}'],\ldots,[c_k\upto {c_k}']$.
\end{itemize}
Any discrepancies can be reported and can be of use to an expert in
diagnosing the problem.

This scheme can be implemented by executing scripts in the user interface
that invoke standard data acquisition functions to do the setting up
(setting timing windows for example) and one or more special tools
that check the resulting data and spectra.
Remember that we are assuming that everything really is working properly
so we can use any of the facilities of the system that we choose.
Any error encountered while setting up (inability to communicate with a
frontend crate, for instance) must terminate the confidence test
prematurely.
Again the error report associated with this failure is valuable
diagnostic input to the expert who then has to sort out the problem.

The scheme will require software support in the Resource Manager
controlling the Trigger Unit to cause the TU to pulse the 
``synchro-test'' VXI line at a given frequency for a given number of times.
Some sort of synchronisation with user interface scripts is also
desirable so that different phases of the confidence test (setting 
different pulser levels, for example) can be properly sequenced.
This might be achieved through the putative exception handling
mechanism.
Alternatively, simple timed delays and status polling of the frontend
crates may be adequate, though this will extend the duration of the
confidence test somewhat (and is inherently unattractive anyway).

It's worth noting what aspects the confidence test does not exercise:
\begin{itemize}
\item system behaviour under realistic conditions of pulse shape,
height, and time distribution;
\item tight timing windows that might be set using a scope;
\item arbitrary user-defined trigger generation logic;
\item arbitrary user-defined EbyE sorting operations;
\item beam on target, detector operation, detector--frontend cabling.
\end{itemize}
Most of these aspects can only be realised with beam on target or with
a radioactive source.
Further confidence tests designed to run under these conditions are
conceivable, but since one has little control over the system inputs
in these circumstances one can do few strict numerical tests on the
resulting outputs.
We note that sources can produce events of multiplicity at most two.


\section{Fault-finding diagnostic tools}
This section examines the problem of identifying a fault once it is
obvious that something is wrong somewhere.
The idea is to perform a sequence of tests on isolated components
until a malfunction appears.
In contrast with the system confidence test we start by assuming that
{\em nothing\/} is working properly.
Only when a component has been shown to be working can we make
use of it in a subsequent test.
Ideally we would like to construct a test sequence that could be
invoked from a user interface with no human intervention (to change
cabling, for example).
Hence we restrict ourselves to what can be achieved with builtin
hardware test features and appropriate software to both set up
the test and check the results.
Procedures that require recabling, attaching probes, exchanging
modules, etc., are the domain of the expert.
We envisage that the software to perform these tests is permanently
loaded in the appropriate crates and invocable by remote procedure
call (RPC).

Our strategy is to gradually build confidence in parts of the system
until a fault is found, using already tested components in exercising
others.
Since the \EG\ system is essentially a pipeline for EbyE data we start
at the downstream end (data storage) and work upstream.
The rest of this section discusses possible test procedures for
each component in turn.

\subsection{Basic crate operation}
Before we can test individual modules we must be confident that
each crate provides some minimal level of functionality.
All crates (VXI and VME) have a processor card (invariably an MVME147 
running \VXW)
which provides ethernet communications with the rest of the system.
To avoid confusion we will refer to this as the ``crate manager'' (CM)
processor. 
Of course, in the case of a VXI crate the CM processor is embedded in
the slot 0 Resource Manager card.
\begin{description}
\item [CM processor] 
If \VXW\ boots to a state capable of responding
to a null RPC we can assume the 147 card and ethernet communications
are satisfactory.

\item [VMEbus] 
We will assume that each crate runs module initialisation
and test code following the \VXW\ boot.
For VXI crates the standard requires that the presence of modules
be sensed and the modules be properly initialised and brought into
operation.
This will certainly exercise the VMEbus and 
if this completes successfully we will assume that the 
VMEbus is operational.
For VME-only crates we would expect to do some elementary module test
and initialisation if only to be confident that the anticipated
modules were indeed present.
Some means of inquiring (by RPC) the success or failure of the
initialisation code is needed. 
We will defer testing of the VMEbus interrupt system until it is required
in a test for a specific module that can raise an interrupt.
%\aside{to what extent can VMEbus faults (eg latched interrupts?) 
%affect normal operation of the 147?}

\item [VXIbus] 
It's difficult to see what can be usefully done here without extra
hardware.
One possibility is to use the RM's capability of inspecting and asserting
designated \EG\ VXI lines to make simple checks for continuity and
held lines.

\item [Power] 
The power rail state for each crate can be checked 
by means of the environment monitoring subsystem.
This should therefore be the first crate checked out!

\end{description}

\subsection{Data Storage}
\begin{description}
\item [Basic operations] 
These can be checked out by simple file 
write/read/compare programs running in the user interface.
\item [Redirection] 
For verifying that upstream components of the EbyE
pipeline are correctly forwarding data it's useful to be able
to direct EbyE data to disc files rather than to tape. 
This eliminates the human intervention required to make sure
tapes are mounted and is much quicker when accessing the small files
diagnostic tests will generate.
\item [Upstream EbyE interface] 
Testing the EbyE data interface into
the storage server requires the selftest feature in the sort unit
described below.
\end{description}

\subsection{Sort Unit (SU)}
\begin{description}
\item [Histogram memory] 
A conventional memory test
\item [Slave processors] 
A conventional processor test for each processor
organised by the crate manager (CM) processor
\item [Interface modules]
Appropriate standalone tests (e.g., of buffer memory) can be
performed by the CM processor.
\item [Selftest feature]  
A means is required of injecting known test
EbyE data at the frontend of the SU. 
One way of doing this is to download a sequence of one or more ersatz
events into the SU and trigger the EbyE pipeline from the SU downstream
to process these events a given number of times as if received from
the Event Builder.
This can be used to exercise the interfaces between the Sort Unit and
Storage Unit.
With a suitable sort program loaded in the SU checks can be made
on its basic sorting capability by inspecting the histograms and
EbyE data files resulting from processing the downloaded events.
\item [Downstream EbyE interface] 
This can be exercised using the SU's
selftest feature or by providing an analogous test for the downstream 
interface alone, using downloaded ersatz events.
\item [Snapshot feature] 
A means of capturing a sample of the EbyE
data entering the SU is valuable.
This might comprise a circular buffer which always holds the last $n$
events or, better, a buffer which captures the first $n$ events after
being explicitly armed and triggered.
Users  can exploit such a feature themselves to check the quality of
their EbyE data in a more direct way than by sorting it since it can
operate in parallel with writing tapes and saves having to inspect
the tape files themselves.
\item [Upstream EbyE interface] 
A test for this requires the Event Builder
selftest feature described below.
\end{description}

\subsection{Event Builder (EB)}
The EB is functionally similar to the Sort Unit and has the corresponding
test requirements:
\begin{description}
\item [Slave processors] As for SU.
\item [Interface modules] As for SU.
\item [Selftest feature] As for SU.
\item [Downstream EbyE interface] As for SU.
\item [Snapshot feature] As for SU.
This feature of the EB is especially important as tests of much of the
upstream hardware rely on inspecting EbyE data arriving at the EB.
A separate implementation within a specific test program within the
EB is desirable.
\item [Upstream EbyE interface] 
A test for this requires the ROCO 
selftest feature described below.
\end{description}

\subsection{Histogrammer}
\begin{description}
\item [Histogramming memory] 
A conventional memory test.
\item [Lookup table] 
A data retention test.
\item [Selftest feature] 
The histogrammer supports a test mode in which
a data word supplied over the VMEbus is processed as if captured from
the DT32bus.
A test exploiting this can be run by the histogrammer's CM processor.
\item [DT32bus interface] 
This can be exercised by means of the 
the DT32bus test outlined below and observing where histogramming memory 
is incremented.
\end{description}

\subsection{DT32bus}
Basic DT32bus operation can be tested by loading known data into a
selected ROCO's FIFO and then enabling the Event Builder. 
The data should appear in the EB and can be inspected
using the snapshot mechanism.
The unselected ROCOs really need to be placed in a ``bypass'' mode
to prevent them from placing data they may have in their FIFOs onto the
bus.
It's the nature of the DT32bus, with signals regenerated at each station,
that fault diagnosis is inherently limited if we do not allow ourselves
the liberty of recabling the bus and exchanging modules.
For example, a failure to receive data at the Event Builder could
be caused by a failure of any of the stations on the bus to propagate
the bus signals correctly.
This behaviour could also suggest a fault in the EB's upstream data
interface.
The Histogramming unit could be used to check independently for bus
data cycles.

\subsection{Readout Controller}
\begin{description}
\item [Selftest feature] 
The ROCO supports a test mode in which data may
be loaded into its FIFO and subsequently transmitted over the DT32bus.
This feature can be used to exercise FIFO emptying and DT32bus interfacing.
The data must be checked when it arrives in the Event Builder, so a
working EB has to be assumed.
\item [VXI Readout] 
Each ROCO drives a readout daisy-chain running
through the set of Ge and BGO (and Trigger) cards in a single crate.
Readout can be exercised by disabling all but one card, loading the FIFO
of the selected card with data, and triggering the ROCO to start the
readout process.
The expected result is a subevent captured in the Event Builder.
Each Ge and BGO card in the daisy-chain can be tried in turn starting
with the card nearest the ROCO.
Possible faults are an empty or corrupt subevent and timeout of the
readout process detected by the ROCO.
Some insight into the latter fault may be had by observing the behaviour
of the ``BLTACK'' VXI line by means of the RM's FALA unit.
\end{description}

\subsection{Trigger Unit (TU)}
Correct operation of the Trigger Unit is required for testing
individual Ge and BGO modules (see below).
\begin{description}
\item [Selftest feature] 
The TU supports a test mode in which
a sumbus input can be simulated by data transferred over the VMEbus.
The event data thus generated can be read out over the VME bus
and inspected by the RM processor.
\item [VXI line outputs]
The VXI lines driven by the TU can (in principle!) be inspected by means
of the Resource Manager FALA facility, and timing data extracted from the 
digitised waveforms.
\item [Scalers]
Scalers within the TU provide evidence that user trigger inputs are being
seen and processed.
\end{description}

\subsection{Ge and BGO cards}
Quite extensive exercising of an individual Ge or BGO card can be envisaged
by disabling its crate's ROCO and using the card's onboard test pulse 
generator in conjunction with the Trigger Unit for
pulsing the ``synchro-test'' VXI line and supplying
suitable Fast Trigger and Validation signals (by setting the TU for
single multiplicity events).
The resulting data can be read from the card's FIFO over the VMEbus.
Hardware Compton suppression can be tested by exercising Ge and BGO
channels in pairs.
More detailed diagnostic tests using the Resource Manager FALA
facility to capture analogue waveforms within modules can be 
envisaged.
However, having isolated a fault to within a single card, it makes
more sense to exchange the card.
Note that confidence in the Trigger Unit and its signal paths to
the Ge and BGO cards can be gained by satisfactory
operation of at least one Ge or BGO card in this test.

We would certainly expect test procedures for these cards to evolve
from standalone commissioning programs, but we recognise that not all
test procedures can be performed with cards in their production
environment.

\subsection{Other detector interfacing}
Interfaces for other detectors and data sources are under consideration.
These include a ``general purpose interface'' (GPI), a NIM--DT32 ADC
interface and a FERA--DT32 interface.
We would expect to see selftest features, analogous to those for
Ge and BGO cards, in these cards as well.
To exploit such features whatever format and mechanics the cards are 
designed in requires a local processor in the role of a networked 
``crate manager'' which can invoke the tests and report the results.

One would prefer not to involve the Trigger Unit (almost certainly in another
crate) in such tests, however.
A test independent of other modules is always more desirable,
provided the necessary timing signals can be generated or
simulated effectively.

\subsection{Other subsystems}
Though not directly part of the EbyE pipeline, other subsystems can
critically affect EbyE data flow:
\begin{description}
\item [High Voltage control]
\aside{Are there any selftest features in the LeCroy HV supplies---anyone?}
\item [Autofill system]
This system necessarily has feedback mechanisms (e.g., N$_2$ exhaust
vent temperature sensors, bias shutdown sensors) which can be used to
monitor proper operation.
\end{description}

\section{Setup consistency checking}
This section addresses the question ``if the system passes the confidence
test and the fault-finding diagnostics fail to find any faults, yet
the user still cannot acquire data to his satisfaction, what then?''
Under these circumstances we must assume that the \EG\ system is in
working order but the user has done something within his competence (!)
to prevent it taking data. 
We will examine ways of diagnosing his mistake.
\begin{description}
\item [Beam problems]
We know of no computer-accessible way of telling whether beam is reaching
the target.
\item [Detectors]
All Ge and BGO detectors have computer-controlled bias supplies.
We envisage a tool which highlights those detectors without bias voltage
within some acceptable working range.
This must take into account the possibility that bias has been shut down
on a ``warm'' detector.
\item [Preamplifiers]
Detector preamplifiers are also interlocked with the detector bias supplies---when
a preamplifier is powered off the interlock prevents bias being applied to
the corresponding detector. 
\item [Cabling]
A cabling fault between detectors and VXI cards cannot be ruled out.
This could be diagnosed by disabling all detectors but one in turn
and observing which channel (if any) the data appears in.
\item [Trigger logic]
Users can determine the logic conditions under which Fast Triggers
and Validations are generated.
Whether they appear or not can be checked by monitoring the statistics
gathered internally by the Trigger Unit.
\item [Ge and BGO card setup]
The data acquisition cards contain several parameters per channel that
critically affect the operation of the cards.
We should implement some sort of check of the user's setup, highlighting
inconsistent  and ``abnormal'' settings.
Wherever possible this should read the user's settings direct from
the Ge and BGO cards.
Settings must also be consistent with those in the Trigger Unit.
\item [Timing considerations]
Using the Resource Manager FALA facility it is possible in principle
to capture critical signal waveforms from VXI lines and from within
data acquisition modules.
Some automatic or semi-automatic check that these are within
specification should be feasible.
\item [Event Building and Sort Processing]
Apart from the essential syntax checking of a user's program there
is little one can do in general.
Possibilities include analysing a program to answer questions such as
``do all paths through the program do something `useful', e.g., 
increment some spectrum?'' and ``which spectra might be incremented
by this program?''
A more general aid might be an offline program emulator.
This would take a sort program and trace its execution (listing
spectrum increments, for example) given a specific input event.
Such a tool might help elucidate sort language semantics for tyro
users or provide a debugging aid for complicated programs.

\end{description}


\end{document}