If an agent becomes inaccessible from the Master Domain Manager or a Domain Manager, the manager will queue messages in it’s pobox directory until connectivity is restored. If an agent is unavailable for a long time, the message files can fill up and breach the default 10MB maximum size. When this happens, batchman on the manager in question will fall over, and no jobs will be executed again until it’s restarted. Without any automated monitoring in place, it’s potentially a difficult condition to spot as only batchman and jobs on the manager will be affected, so JSC, TDWC and Webadmin will continue to function in some areas.
If you notice workstations have become unlinked, log on to the manager and go into conman and type ‘v’.
%v
TWS for UNIX (SOLARIS)/CONMAN 8.2 (1.36.2.31)
Licensed Materials Property of IBM
5698-WKB
(C) Copyright IBM Corp 1998,2001
US Government User Restricted Rights
Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Schedule (Exp) 02/18/11 (#4387) on TWSMASTER. Batchman Down. Limit: 10, Fence: 0, Audit Level: 1
If batchman is down at a time that isn’t around Jnextday then it’s likely fallen over due to a full filesystem, or a message file filling up. Do not try and restart it yet.
Firstly, check the TWS merge log to try and identify why batchman went down:
06:55:59 17.02.2011|MAILMAN:+ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
06:55:59 17.02.2011|MAILMAN:+ AWSBCV015E Error writing to PO Box for TWSFTA, Error
06:55:59 17.02.2011|MAILMAN:+ AWSDEC003I End of file on events file.
06:55:59 17.02.2011|MAILMAN:+ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
06:56:11 17.02.2011|BATCHMAN:#J0/Operator command: QUIT
06:56:11 17.02.2011|BATCHMAN:#J3629/Operator command: STOP
06:56:11 17.02.2011|JOBMAN:AWSBDW015I Received command Qt for run number 2509 from cpu MDMPRE.
06:56:11 17.02.2011|JOBMAN:AWSBDW051I Received quit message.
06:56:11 17.02.2011|BATCHMAN:BATCHMAN stopping now.
06:56:11 17.02.2011|JOBMAN:Terminating – cpu usage 0
The log file shows an end of file events state has been reached meaning the message file has filled up, in this instance, for the agent on TWSFTA. Batchman won’t restart while it’s in this state, and while you can simply delete the offending message file, you will lose all the queued messages and updates since the agent went offline so the status of the batch will be difficult to determine.
A less destructive method is to try and compact the message file and increase it’s maximum size. If the procedure works, no messages will be lost and batchman can be restarted.
ls -l /opt/IBM/TWA/TWS/pobox
-rw——- 1 maestro tivoli 9999602 Feb 17 06:55 TWSFTA.msg
The offending file is showing as hitting the 10MB limit.
The evtsize command will show you the internal state of the message file:-
$ /opt/IBM/TWA/TWS/bin/evtsize -show /opt/IBM/TWA/TWS/pobox/TWSFTA.msg
TWS for UNIX (SOLARIS)/EVTSIZE 8.2 (1.2.2.6)
Licensed Materials Property of IBM
5698-WKB
(C) Copyright IBM Corp 1998,2003
US Government User Restricted Rights
Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
AWSDEK703I Queue size current 9999554, maximum 10000000 bytes (read 48, write 9999602)
Compact the file and increase it’s maximum size by running:-
/opt/mdm/maestro/bin/evtsize -compact /opt/IBM/TWA/TWS/pobox/TWSFTA.msg 20000000
This will set the maximum size of the message file to 20MB. Running the evtsize –show command again will show the new setting has been applied.
$ /opt/IBM/TWA/TWS/bin/evtsize -show /opt/IBM/TWA/TWS/pobox/TWSFTA.msg
TWS for UNIX (SOLARIS)/EVTSIZE 8.2 (1.2.2.6)
Licensed Materials Property of IBM
5698-WKB
(C) Copyright IBM Corp 1998,2003
US Government User Restricted Rights
Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
AWSDEK703I Queue size current 9999554, maximum 20000000 bytes (read 48, write 9999602)
Now go into conman and issue a unlink @;noask followed by a stop and a shutdown;wait. When the prompt returns check all the maestro TWS processes are out. Go into TWSHOME and issue a ./StartUp. Go in to conman
and issue a start . Batchman should go into a LIVES status and remain up.
If the offending agent is back on the network and communicating with it’s manager, you should see the message file size decreasing:-
$ /opt/IBM/TWA/TWS/bin/evtsize -show /opt/IBM/TWA/TWS/pobox/TWSFTA.m>
TWS for UNIX (SOLARIS)/EVTSIZE 8.2 (1.2.2.6)
Licensed Materials Property of IBM
5698-WKB
(C) Copyright IBM Corp 1998,2003
US Government User Restricted Rights
Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
AWSDEK703I Queue size current 578000, maximum 20000000 bytes (read 48, write 578048)
$ /opt/IBM/TWA/TWS/bin/evtsize -show /opt/IBM/TWA/TWS/pobox/TWSFTA.msg
TWS for UNIX (SOLARIS)/EVTSIZE 8.2 (1.2.2.6)
Licensed Materials Property of IBM
5698-WKB
(C) Copyright IBM Corp 1998,2003
US Government User Restricted Rights
Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
AWSDEK703I Queue size current 55056, maximum 20000000 bytes (read 1134288, write 1189344)
$ /opt/IBM/TWA/TWS/bin/evtsize -show /opt/IBM/TWA/TWS/pobox/TWSFTA.msg
TWS for UNIX (SOLARIS)/EVTSIZE 8.2 (1.2.2.6)
Licensed Materials Property of IBM
5698-WKB
(C) Copyright IBM Corp 1998,2003
US Government User Restricted Rights
Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
AWSDEK703I Queue size current 0, maximum 20000000 bytes (read 48, write 48)
The actual file size should now be down to the default size of 48 bytes:
ls -l/opt/IBM/TWA/TWS/pobox
rw——- 1 maestro tivoli 48 Feb 17 10:03 TWSFTA.msg
Mark Delaney SYSTEMSMANAGED Ltd
Thanks a bunch Mark… Saved me battling with IBM docs and PMR for an answer!