This is a guide for "shift taking" during DC2 to ensure the proper functioning of the production system, and all the sub-systems and servers on which it depends. The US ATLAS Tier1 Center and the iVDGL iGOC operations center have developed an operational service model to faciliate this. Please consult the BNL Tier1/iGOC Operations Service Guide (u/p: usatlas, usatlas) to familiarize yourself with problem reporting and escalation procedures.
This describes the model for response of the BNL Tier1 center to
problems reported by the DC2 shift, or by the iGOC. In the below,
reporters can be either DC2
production team members or iGOC operators. The on-call person is
either of Dantong, Xin, Jason, or Wensheng (local ATLAS DC2 staff).
On-call person during the week should carry your pager around, or setup page forwarding to your cell phone which you carry around.
When an on-call person receives a phone call, you should take down the caller's phone number and any other contact information.
Diagnose and fix the problem. In the mean time, you also need to constantly provide the original reporter with the progress. The update frequency can be half an hour, once an hour, that depends on your judgment. This can help the reporter to adjust their work plan.
When you fix the problem, it is very IMPORTANT for you to notify the reporter and let DC2 team resume the production. In the mean time, you need to summarize the problem and your solutions, send email to gts-discuss list, cc to iGOC and four of us. If you can not fix the problem due to various reasons, you need to notify the next reachable person in the list. You need to provide him the reporter's contact phone number and clear problem description. You will be then off-duty and the new person will be the primary contact and take over the problem completely.
When the next on-call person takes over the problem, he should
first call back the original reporter to let them know you are the
primary contact and responsible person, then perform the duties listed
If any of us can not fix the problem, such as network problem,
NFS file system crash, the primary contact needs to notify the
responsible person, gets status from the person who does the recovery,
feedback to the original reporter. When the problem is resolved, do
If the problems are related with RCF/USATLAS, call RCF operator 344-5480. If the problem happens between 12:00AM~9:00AM, call Dantong so that I can take it over to filter out the calls. If the problems are related with BNL, call ITD help desk 344-5522. It is 24/7 support.
|Grid3 GridCat||URL||ATLAS Jobs||URL|
|Grid3 Ganglia||URL||ATLAS I/O||URL|
|BNL GridCat||URL||Prod DB||Capone-view|
|BNL Ganglia||URL||Prod DB||Standard, full queries|
|U Mich PBS
What to do when bugs are discovered.
The DC2 Shift Calendar is used to schedule coverage of DC2 operations on Grid3. You can modify the calenar (sign up for shifts) by pressing the "Add/Edit Events" tab, press "cancel" for the "Select User Name", and fill in the form. A password is required for the changes to take effect (the usual ATLAS password is used).
Immediate response to any alarms or conditions, especially those concerning the core, critical services for DC2. Many times this will involve facilitating communication for service experts, iGOC staff, and DC2 submitters.
Keeping the production queues saturated at all times, under normal conditions, especially the priority-queued sites (FNAL, PDSF, Caltech) and the core, workhorse sites in US ATLAS (BNL, UTA, BU, IU, UC).
Check usatlas1 Globus GASS caches of the heavily used sites.
Survey of $DATA areas on heavily used sites. Doing cleanup if required. Our interim policy, until we can think of something better, is to:
% get-remote-jobinfoUsage: get-remote-jobinfo <jobID> <full_path_to_jobs_directory>
% get-workdir-spaceUsage: get-workdir-space <all|pool_handle>
% striptmpThis script can be used to remove from the tmp directory all files older than a specified number of days. Usage: striptmp 3
Query the production
database to find out who is submitting jobs:
count(*), executor from jobexecution where
executortype='capone' and jobstatus='submitted' group by
executor order by 1 desc
Please include the result of this query in every shift summary, so that dead jobs do not build up.
Write a summary for the shift, including details such as:
Production statistics for the previous 12 hours.
Inventory of problems and incidents, including detailed information regarding understanding of failures and resolution.
Notices of any planned service outages for the next shifts.
Monitor jobs, pause/resume, for the common submit hosts (see below)
Nurcan/Mark will update the software on these 3 hosts when necessary.Accessing the common submit host at BNL:
then >ssh -l sm atlasprod4.usatlas.bnl.govWensheng will give you the pw.
Start of Shift (these are specific to UTA, but similar apply for BNL). Open session on atlas005.uta.edu (the same for other 2 hosts at UTA)
./capone status | tail -20
If everything looks ok with capone, open another session. ./launch_executor. turn on 'print'
If ok, open another session. ./launch_supervisor. 'print'
All current jobs will be recovered - you will see each job number scroll on supervisor screen. After recovery, you should see getStatus, and then new jobs will be submitted.
Periodically check ./capone status | tail or condor_q.
PLEASE DO NOT CHANGE windmill.xml or capone.ini. Use 'stop' or 'pause' in supervisor window, if there are problems. If a CE in capone.ini is going to be down for a long period of time, only then change the CElist and do ./capone reload.
Type ./capone status summary and put the result in shift summary. Note, this will say capone is not running - this is not correct.
Do not try to start or restart capone from this host. Send email to Kaushik if that becomes necessary. ./capone check does not work for now. Use ./capone status and check log files for diagnostics.
|BNL||atlasgrid02.usatlas.bnl.gov||Globus RLS Server||Critical||BNL gridcat|
|BNL||atlasgrid02.usatlas.bnl.gov||DQ server||Critical||BNL gridcat|
|BNL||aftpexp01.bnl.gov||Storage Elements||Critical||BNL gridcat|
|BNL||aftpexp02.bnl.gov||Storage Elements||Critical||BNL gridcat|
|BNL||db1.usatlas.bnl.gov||VDC Server||Critical||BNL gridcat|
|CERN||prdb01||Production database||Critical||PHP URL|