Robert’s Trouble Shooting Guide

The following is offered for your assistance, without prejudice.

INDEX:
	Safety
	Overall
	Gather the History
	Check It Out
	Closing in on the Problem
	Isolating the Problem
	Effecting a Solution
	Before you Consider the Job Done
	When the Problem Resists Solution

Safety:

By the general nature of this article, safety cannot be adequately addressed. Work carefully, and with due care and attention to what you are doing and to what is going on around you. Apply common sense in thinking and action, for your personal safety, for the safety of others, and for the preservation of equipment, tools and environment.

A system, especially in failure mode, can be dangerous. Although dangers are always present, one existing failure could indicate, or generate, a weakness in the system. An existing failure could also allow the compounding of a second failure, whether that second failure was dependent on the first, or independent of it. Such additional failures could precipitate further damage to the equipment. Also possible is injury to personnel, and damage to other equipment in the vicinity. These consequences must be considered before any further running of the equipment, including test running, is performed.

With guards removed, as may be necessary to troubleshoot, the safety of the guard must be replaced by additional diligence, attention and care.

Overall:

Trouble shooting may turn out to be an iterative process. Always be prepared to go back to a previous step, and try pursuing new ideas. When collecting data, be aware of this, and collect enough so that ‘going back’ is to go back to previous data, and not to collect more data. This will help you save time – your time and that of others.

If the ‘data’ you do collect is not internally consistent – you may have collected conclusions, not data; or your ‘data’ may be wrong – or both. Try to resolve any discrepancies before continuing with another step – or make it a goal of the next step to resolve these discrepancies.

Gather the History:

Before fixing something, you should know what is wrong. Knowing what has happened can help you look for the problem.

What can anyone tell you?
- What does Management say the problem is?
- What does Maintenance say the problem is?
- What does Production (the user) say the problem is?
Did it ever work? Are there known quirks with the system?
- Some of the pointers in this article may be useful during project development, when one has a system which, as viewed on paper and in simulation should work; but which is not working in the physical form;
- Other special considerations should be give to a system which worked at the bench-top or pilot-plant level, but is now not working at a larger-scale.
Who was there when it quit working (or started making strange noises, or otherwise stopped behaving normally)?
- Were there any particular observations made?
What changes have been made since it was working well?
- Were there any well-intentioned modifications which could have gone wrong?
What repair attempts have been made?
- What were the results and findings of these efforts?
- If repairs have been attempted, what remains from those efforts?
What evidence is available from the equipment?
- Particles, or pieces, on the floor;
- Broken seals, and bent parts;
- Damaged components within;
- etc.

Check It Out:

Verify that the problem, as defined, exists.

If the problem is intermittent, you should know that from the start.
If the problem involves a complex set of pre-conditions, then the pre-conditions are probably involved.
If the problem is based in the user’s unfamiliarity with the proper operation of the equipment, equipment repair can not resolve the situation.

Obtain first-hand experience with the problem, and reduce it to a simple form. This can help you make sense of what you were told.

You may find yourself re-interpreting what was said as you realize that those who talked to you had jumped to false conclusions, or were abusing technical terms.
What you were told may have been quite true — at one time — but is now obsolete information.
If the problem was generated with a complex set of pre-conditions, find out which are critical to replicating the problem, and which are extraneous.

Consider these questions, as you hypothesize where the problem may be:

What else is not working?
What portions of the system are working properly?

There may be a quick path to rectification, as suggested in the answer to the following:

Have you seen this problem before? Have others heard of this assemblage of symptoms? Etc.
- It may be considered likely that the same solution is needed – provided this is not the same system, in which case it seems the ‘solution’ did not work.
- Keep in mind, it may be a different problem with a similar symptom.
Are there signs of:
- Damage?
- Improper Adjustment?
- Tampering?

If there is need to look further, consider these:

What has changed since last the system was working?
What do the indicator lights, gauges, and other devices say? Do you believe these devices?
If there seem to be multiple problems, consider that they are multiple symptoms with a common cause.

Closing in on the Problem:

From what you have been told, what is the likely nature of the problem? Try to define two points: “Everything up to here is working”; and “Beyond here, things would function if the inputs were right.” The input of simulated signals may help in defining this latter point.

Select a spot, between the “OK-to-here” point and the “OK-from-here” point and determine, if possible, if the process is, or is not, functioning up to and/or beyond this new point. This should allow one of the earlier points (“OK-to-here” and “OK-from-here”) to be moved closer to the other. This selected spot for testing could be chosen based on convenience with which the test can be done, convenience of repair or replace of a component, or based on a suspicion as to where the problem is. Failing any other logic, the new point should be roughly ‘half way’ between the other two. (‘Half-Way’ could be based on: The count of components modifying, or otherwise changing the signal count; or The distance the signal travels; or Something else.)

If the suspect signal splits, and the failure is seen only at the end of one branch, then the system is probably OK, up to the split. If the system is not working, on any branch beyond the split, there are three possibilities: A single failure, upstream of the split could have happened, and may be considered likely; But multiple simultaneous failures, downstream of the split could have been caused by the likes of a voltage spike or pressure surge. Consider the possibilities, likelihood, and testing options, before choosing where and how to look next; and keep in mind that one may be working on a hunch, at best. The option not chosen may have be hiding the solution. Of course, in a bad-case situation, a surge may have spawned damage to the signal source and to equipment downstream of the split.

Reiterate the foregoing while practical.

Sometimes finding the problem can be done by actually implicating a specific component or module. At other times – especially when dealing with a system in development (a system that has never worked) – there can be a subtle difference between ‘finding and resolving a problem’, and ‘resolving a problem, there by identifying it so it can be avoided in the future’. In this case, if applying a solution to a hypothesized problem resolves it, then one has likely identified the problem. Use this approach with prudence when working on a system which has previously been fully and properly commissioned and functioning.

Isolating the Problem:

Whether to perform a simple test on a component that is likely to be good, or a complicated diagnostic test on a component that is likely to have failed, is a judgement decision. How simple versus complicated? How unlikely versus likely? How strongly do the indicators point to the component involved in the complicated test? What else would you learn in doing the simple test?

Can you swap components in the suspect area with believed-good components from a functioning system, or from available spares? (Note that there may be special settings or adjustments which will have to be implemented on a replacement part – and undone, if the part is to be returned to its previous location, or at a minimum documented if the part is to be returned to storage.)

Does the replacement component make the system work, or does it leave the system in the same failure mode? If the same failure mode is exhibited, can you verify that the replacement part is not defective?

Does the suspect component work where the replacement was taken from, or does it replicate the failure mode in that system?

When components are returned to their original locations, does the problem move back? (This is known as A-B-A testing, and can help identify problems that are simply from loose connections and/or improper adjustment.)

If swapping components introduces a new failure mode in either system: It is time to slow down and think!

If a previously-working systems exhibit a failure mode when components are swapped back into their original position: It is time to slow way down and have a long think!

Effecting a Solution:

Resist the temptation to make adjustments before knowing that an adjustment is needed. This is especially true where the problem could be as simple as an electronic component needing a reset.

Resist the temptation to make adjustments before knowing where an adjustment is needed. If there is an alignment problem between two components, determine which should be moved to minimize the necessity of any subsequent realignments between the balance of the system and the adjusted component.

If you understand what caused the problem, you can probably understand what the problem is, and how it should be addressed. If you figure there is no way the failure could have happened – you may not be knowing, truly, what has failed.

Is the economic/expedient route to repair, to replace, or to troubleshoot further? Should you instead make a ‘work-around’?

Even if you are sure of what adjustments are to be made, consider generating enough information to allow an “undo.” This information can also be used to judge the magnitude of the change being made, which is perhaps useful in other ways.

Ensure the repairs make the system functional.

Before you Consider the Job Done:

Remove tools, undo special settings, close the machine, replace guards, return borrowed tools, etc.

If you have cured the symptom but not the cause, what plans should be made to find and cure the problem? If the solution was a ‘work-around’, is it to be permanent?

If the problem was the result of abuse, is re-training (of operator, maintenance, or other crews) required? Should the manual be revised?

If the solution was a ‘work-around’, is re-training required?

Re-ensure, after replacing all guards and covers, that all operations are functional. Report to the customer that the equipment is available for use. (At the risk of embarrassing yourself, the steps of this paragraph may, in some situations, be interchanged.)

If troubleshooting turned into a ghost-hunting exercise, and no significant repairs were made, then make enough notes so that you, or someone else, can pick up from where you left of when [‘when’ not ‘if’] the ghost re-appears.

Document and share your experience, so others may know what areas are prone to failure.

Especially if you implemented a work-around: Leave documentation with the system so others may operate and maintain what is now non-standard equipment; Share with colleagues, and file documentation, so others may re-use the work-around you developed.

Consider the utility of further examinations, which would come from a “root cause failure analysis.”

When the Problem Resists Solution:

On a rare occasion, you may encounter a problem which seems to resist sensible logic; when the simplest solution is not a solution. One or more of the following may help:

If everybody else has been looking, without success, for a mechanical source of the problem, and finding nothing of note – Look to the electrical side.
If everybody else has been looking, without success, for an electrical source of the problem, and finding nothing of note – Look to the mechanical side.
In an analogue system, remember the digital nature of analogue signals. (Quantized steps, and saturation levels.)
In a digital system, remember the analogue nature of digital signals. (Rise and fall times.)
- Back in the days when PC to printer connection used parallel communication (IEEE 1284, or a progenitor), I had a combination of computer, cable, and printer, in which the character pair ‘n ’ (‘n’ followed by a space [0111 0100; 0010 0000]) would occasionally be printed as ‘n!’ [0111 0100; 0010 0001]. The problem needed all three components present, and happened irregularly.
- Suspecting signal interference at the analogue level, I looked at the bit pattern of the characters involved and noted their similarities and differences. On challenging the system with another, similar, character set – and getting the homologous error, more frequently (‘~@’ [0111 1110; 0100 0000] printing as ‘~A’ [0111 1110; 0100 0001]) – I was confident, without resorting to an oscilloscope to check signal timing, of what was involved.
- Soldering a small capacitor across the line and return traces of the LSB of the data line resolved the problem.

Copyright

This material is Copyright (© 1998, through 2016), Robert W. C. Stevens. Reproduction, with this copyright notice intact, is permitted – but sharing the URL would save a tree, and probably make more sense.

Robert’s Home Page

The latest version of this page may be accessed at
http://www.wendygamble.com/RwcS/Guides/TroubleShooting.html

Pleased To Be Of Service, RwcS