USATLAS Data Challenge 2 Shift Guide


BNL Operations Center: M-F 8AM-11AM CDT, 631-344-5480
iGOC contact information: 24x7, igoc@ivdgl.org, 317-278-9699
 
Last updated: 9/19/04

Contents


Introduction

This is a guide for "shift taking" during DC2 to ensure the proper functioning of the production system, and all the sub-systems and servers on which it depends.  The US ATLAS Tier1 Center and the iVDGL iGOC operations center have developed an operational service model to faciliate this.  Please consult the BNL Tier1/iGOC Operations Service Guide  (u/p: usatlas, usatlas) to familiarize yourself with problem reporting and escalation procedures.

BNL Tier1 Response Model

This describes the model for response of the BNL Tier1 center to problems reported by the DC2 shift, or by the iGOC.  In the below, reporters can be either DC2 production team members or iGOC operators.  The on-call person is either of Dantong, Xin, Jason, or Wensheng (local ATLAS DC2 staff).

  1. On-call person during the week should carry your pager around, or setup page forwarding to your cell phone which you carry around.

  2. When an on-call person receives a phone call, you should take down the caller's phone number and any other contact information.

  3. Diagnose and fix the problem. In the mean time, you also need to constantly provide the original reporter with the progress. The update frequency can be half an hour, once an hour, that depends on your judgment. This can help the reporter to adjust their work plan.

  4. When you fix the problem, it is very IMPORTANT for you to notify the reporter and let DC2 team resume the production. In the mean time, you need to summarize the problem and your solutions, send email to gts-discuss list, cc to iGOC and four of us. If you can not fix the problem due to various reasons, you need to notify the next reachable person in the list. You need to provide him the reporter's contact phone number and clear problem description. You will be then off-duty and the new person will be the primary contact and take over the problem completely.

  5. When the next on-call person takes over the problem, he should first call back the original reporter to let them know you are the primary contact and responsible person, then perform the duties listed in 3,4,5.

  6. If any of us can not fix the problem, such as network problem, NFS file system crash,  the primary contact needs to notify the responsible person, gets status from the person who does the recovery, feedback to the original reporter. When the problem is resolved, do item 4.
                                                                                                                 
    If the problems are related with RCF/USATLAS, call RCF operator 344-5480. If the problem happens between 12:00AM~9:00AM, call Dantong so that I can take it over to filter out the calls. If the problems are related with BNL, call ITD help desk 344-5522. It is 24/7 support.


Monitors

Monitors for DC2 include both grid and job centric monitors.
 

Grid Jobs
Grid3 Monalisa URL ACDC URL
Grid3 GridCat URL ATLAS Jobs URL
Grid3 Ganglia URL ATLAS I/O URL
BNL GridCat URL Prod DB Capone-view
BNL Ganglia URL Prod DB Standard, full queries


PDSF URL


NorduGrid URL


ATLAS Total  URL


U Mich PBS
URL


BU PBS
URL


UC Condor
URL


Bug Reports

What to do when bugs are discovered.

 

Shift Calendar

The DC2 Shift Calendar is used to schedule coverage of DC2 operations on Grid3. You can modify the calenar (sign up for shifts) by pressing the "Add/Edit Events" tab, press "cancel" for the "Select User Name", and fill in the form.  A password is required for the changes to take effect (the usual ATLAS password is used).


Shift Duties

Shift duties include:

Monitoring Common Submit Hosts

There are 3 common submit hosts at UTA:
    The directory structure: /atlas-grid/dc2submit/ having wx permissions to the group. The subdirectories: Each common host usually has different CEs so that the submission can be made in slow rate to every site at the same time. Only the shift person changes CEs occasionally in case of an emergency for a site. The shift person exits  executor/supervisor windows at the end of the shift, and puts the capone status in the shift summary, which is mailed to GTS.

    Nurcan/Mark will update the software on these 3 hosts when necessary.

Accessing the  common submit host at BNL:

login to atlasgw.bnl.gov,
then >ssh -l sm atlasprod4.usatlas.bnl.gov
Wensheng will give you the pw.


Start of Shift (these are specific to UTA, but similar apply for BNL).   Open session on atlas005.uta.edu (the same for other 2 hosts at UTA)

    cd /atlas-grid/dc2submit/capone/Capone-only/capone

    source /atlas-grid/dc2submit/gce/setup.sh

    ./capone status | tail -20

    If everything looks ok with capone, open another session. ./launch_executor. turn on 'print'

    If ok, open another session. ./launch_supervisor. 'print'

    All current jobs will be recovered - you will see each job number scroll on supervisor screen. After recovery, you should see getStatus, and then new jobs will be submitted.

During your shift:
    Periodically check supervisor window for errors.

    Periodically check ./capone status | tail or condor_q.

    PLEASE DO NOT CHANGE windmill.xml or capone.ini. Use 'stop' or 'pause' in supervisor window, if there are problems. If a CE in capone.ini is going to be down for a long period of time, only then change the CElist and do ./capone reload.

End of shift:
    Type 'stop' in supervisor window. Type 'status' and put the result in shift summary, 'exit' session when safe.

    Type ./capone status summary and put the result in shift summary. Note, this will say capone is not running - this is not correct.

If anything goes wrong:
    'pause' supervisor, if possible. Then send email to GTS.

    Do not try to start or restart capone from this host. Send email to Kaushik if that becomes necessary. ./capone check does not work for now. Use ./capone status and check log files for diagnostics.


Core Services

This is an exerpt from the Tier 1/iGOC operations service guide.

Inventory and Descriptions


Site Host names Services Criticality Monitor
BNL atlasgrid02.usatlas.bnl.gov Globus RLS Server Critical BNL gridcat
BNL atlasgrid02.usatlas.bnl.gov DQ server Critical BNL gridcat
BNL aftpexp01.bnl.gov Storage Elements Critical BNL gridcat
BNL aftpexp02.bnl.gov Storage Elements Critical BNL gridcat
BNL db1.usatlas.bnl.gov VDC Server Critical BNL gridcat
BNL atlasgrid01.usatlas.bnl.gov Gatekeeper High BNL gridcat
BNL spider.usatlas.bnl.gov Gatekeeper High BNL gridcat
UTA atlas000.uta.edu Jabber Critical
CERN prdb01 Production database Critical PHP URL


References

  1. How to Run Jobs on Grid3 with Windmill and Capone
  2. US ATLAS Grid Tools and Services
  3. US ATLAS Grid homepage