Windmill, Capone, and GCE: How-To Execute Jobs on Grid3


Instructions valid for (updated 12/06/04; please send corrections to rwg at hep dot uchicago dot edu):

Windmill 0.9.15 (homepage)

Capone 0.6.11 (homepage)

GCE 0.5.43 (http://griddev.uchicago.edu/swhome/atgce/)

ATLAS release 8.0.8

Links:


Install and test GCE-Client

1) Working area
 
 
%cd /grid/data2a/users/rwg/


2) Install Pacman
 

 
% mkdir Pacman-latest
% cd Pacman-latest
% wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-3.0100.tar.gz
% tar -xvzf pacman-3.0100.tar.gz
% cd pacman-3.0100
% source ./setup.csh 
/grid/data2a/users/rwg/Pacman-latest/pacman-3.0100
Your Python is version 1.5.2.  Building Python 2.2.3 for Pacman:
Untarring...
Building..
Cleaning up...
Python [2.2.3] has been built.
Source setup.csh(sh) one more time and you're ready to use Pacman 3.
source ./setup.csh
/grid/data3a/users/rwg/Pacman-latest/pacman-3.0100



NOTE: 06-dec-2004: Pacman3 is still experiencing problems successfully installing VDT-1.2.x

UNTIL FURTHER NOTICE - USE Pacman-version 2

2.1)  Install Pacman  version 2.126 in a fashion similar to the above.
 
 
 
% wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-2.126.tar.gz
% (etc)
 

3) Install GCE-Client

 
% pacman -get GCL:GCE-Client
(takes about 20-40 minutes depending on network)
The amount of storage needed for this package is about 560 Mbytes.


    afterwards:
 

 
[rwg@tier2-02 ~/gce-client]$ stripsetup 
[rwg@tier2-02 ~/gce-client]$ source setup.csh
[rwg@tier2-02 ~/gce-client]$ vdt-version
You have installed the complete VDT version 1.2.1:
    Virtual Data System 1.2.13
    ClassAds 0.9.5
    Condor 6.6.6
    EDG CRL Update 1.2.5
    EDG Make Gridmap 2.1.0
    Fault Tolerant Shell (ftsh) 2.0.5
    Globus 2.4.3 plus patches
    GLUE Information providers, (CVS version 1.79, 4-April-2004)
    GLUE Schema 1.1, extended version 1
    GPT 3.1
    GSI-Enabled OpenSSH 3.4
    Java SDK 1.4.2_05
    KX509 2031111
    Logrotate 3.3
    Monalisa 1.2.12
    MyProxy 1.11
    Netlogger 2.2
    PyGlobus 1.0
    PyGlobusURLCopy 1.1.2.11
    RLS 2.1.5
    UberFTP 1.3
[rwg@tier2-02 ~/gce-client]$ vds-version
1.2.13
[rwg@tier2-02 ~/gce-client]$ gce-version
++++++++++++++++++++++++++++++++++++++++++++
GCE Client 0.5.43 installed at  Fri Nov 12 13:40:20 CDT 2004
++++++++++++++++++++++++++++++++++++++++++++

Install Capone

Create a separate directory, cd to it and do (note: you must use Pacman3 for Capone-only,
and you must setup GCE-Client first):
 
 
[rwg@tier2-02 ~/Capone-only]$ pacman -get GCL:Capone-only
Do you want to add [GCL] to [trusted.caches]? (y or n): y
Package [Capone-only] found in [GCL]...
Downloading [capone-0.6.11.tar.gz] from [grid.uchicago.edu]...
    874/874 k downloaded...
Untarring [capone-0.6.11.tar.gz]...
[rwg@tier2-02 ~/Capone-only]$ 

Install Windmill

Create another separate directory for windmill, and do (must be Pacman3):
 
 
[rwg@tier2-02 ~/windmill]$ pacman -get UTA:Windmill
Do you want to trust the cache: [UTA]? (y or n): y
Package [Windmill] found in cache [http://heppc12.uta.edu/pacman/]...
Do you want to trust the cache: [BU-ATLAS]? (y or n): y
Package [DC2-Base] found in cache [http://atlas.bu.edu/caches/]...
Package [DC2-Registry] found in cache [http://atlas.bu.edu/caches/]...
Package [Atlas-Snapshots] found in cache [http://atlas.bu.edu/caches/]...
Downloading [windmill-0.9.15.tar.gz] from [www-hep.uta.edu]...
       14/14 Megs downloaded...                             
Untarring [windmill-0.9.15.tar.gz]...
    4 packages in the installation...
    4 nodes in the dependency tree...
[rwg@tier2-02 ~/windmill]$ 

Configuration

1) Configure Capone (make sure you've first setup GCE-Client)
 
% ln -fs $GCE_LOCATION lib/gce-client
 
...
jobDir: /home/atlas/rwg/dc2
...
 
#Web Services configuration (host/port)
[ws]
port: 8043
host: tier2-02.uchicago.edu
 
########################################## 
#CPE configuration
##########################################
[cpe]
#
 1 to Register the output files to RLS (0 not to)
regOutput: 1
 
#
#UChicago rls://grid01.uchicago.edu/            # use for development
#rliURL: rls://grid01.uchicago.edu 
#lrcURL: rls://grid01.uchicago.edu
#
#BNL      rls://atlasgrid02.usatlas.bnl.gov     # use for production
rliURL: rls://atlasgrid02.usatlas.bnl.gov 
lrcURL: rls://atlasgrid02.usatlas.bnl.gov
 
###########################################
#scheduler
###########################################
[scheduler]
#Possible scheduling policies: WRR RR WRC Wrandom random override
# 'WRR'WeightedRR
# 'RR' RoudRobin from the list of available Sites
# 'WRC' WeightedRandom with Consumption (order is random but the share is the same of WRR)
# 'Wrandom' Weighted random
# 'random' (default) Randomly select one CE
# 'override' selects always the defaultCE
policy: WRR
# Used only when 'override'
defaultCE: UC_ATLAS_Tier2
#To limit the CEs to choose from:
####################examples of CE lists and weights
#CEs: ANL_HEP ANL_Jazz BNL_ATLAS BU_ATLAS_Tier2 CalTech_Grid3 
CalTech_PG FNAL_CMS FNAL_SDSS ISI IU_ATLAS_Tier2 
IU_iuatlas JHopkins KNU PDSF UC_Grid3 UCSanDiego UCSanDiego_PG UFlorida_Grid3 
UFlorida_PG UM_ATLAS UNM_HPC UTA_DPCC
#weightCEs: 1 0 30 86 24 132 277 0 35 208 4 0 50 449 3 0 592 40 82 29 426 218
#ces: ANL_HEP BNL_ATLAS BU_ATLAS_Tier2 CalTech_Grid3 CalTech_PG FNAL_CMS ISI 
IU_ATLAS_Tier2 PDSF UC_Grid3 UCSanDiego_PG UFlorida_Grid3
####################real lists:
CEs: BNL_ATLAS_BAK BU_ATLAS_Tier2 UC_ATLAS_Tier2 IU_ATLAS_Tier2 UTA_DPCC
weightCEs: 176 86 32 208 218
stageout: override
#double slash to patch current bug in stageout.bash
# Used when 'override' or when no hint from supervisor is available/possible
defaultSE: gsiftp://aftpexp.bnl.gov//usatlas/data01/prod/dc2/captest/
#defaultSE: gsiftp://aftpexp01.bnl.gov//usatlas/data01/prod/dc2/captest/
#defaultSE: gsiftp://aftpexp02.bnl.gov//usatlas/data01/prod/dc2/captest/
#1 To have a scheduling only log file
log: 1
logFile: var/scheduler.log
 

Storage element policy settings:

 
#Possible SE scheduling policies: RR random override
# 'RR' RoudRobin from the list of available Sites
# 'random' (default) Randomly select one SE
# 'override' selects always the defaultSE
# 'hint' uses the SE hint - not implemented
stageout: RR
#######fixed?double slash to patch current bug in stageout.bash
# Default, used when 'override' or when no hint from supervisor is available/possible
defaultSE: gsiftp://aftpexp.bnl.gov/usatlas/data01/prod/dc2/captest/
#defaultSE: gsiftp://grid02.uchicago.edu/grid/data2a/ATLAS_SE/DC2/captest/
# SE list (space separated) aftpexp01.bnl.gov (VDT 1.1.14) aftpexp02.bnl.gov (VDT 1.1.13) 
aftpexp.bnl.gov
SEs: gsiftp://tier2-01.uchicago.edu/share/data2/atlas_SE/dc2/captest/ 
gsiftp://gridftp.usatlas.bnl.gov/usatlas/data01/prod/dc2/captest/ 
gsiftp://aftpexp01.usatlas.bnl.gov/usatlas/data01/prod/dc2/captest/ 
gsiftp://aftpexp02.usatlas.bnl.gov/usatlas/data01/prod/dc2/captest/ 
gsiftp://grid02.uchicago.edu/grid/data2a/ATLAS_SE/DC2/captest/

 

2) Configure GCE to use the MySQL VDC:

Change <gce-install-dir>/GCE-Client/vds/etc/properties (Pacman3 installed) or <gce-install-dir>/vds/etc/properties (Pacman2 installed) with (example show for production):
 
vds.replica.mode             rls
vds.rls.url                  rls://atlasgrid02.usatlas.bnl.gov
vds.tc.file                  /home/rwg/gce-client/gce-client/etc/tc.data
vds.pool.mode                xml
vds.pool.file                /home/rwg/gce-client/gce-client/etc/pool.config.xml
vds.home.localstatedir       /home/rwg/gce-client/gce-client/var
##################################
#To use the database uncomment the following lines and comment the vds.db.file.store
vds.db.vdc.schema=ChunkSchema
vds.db.ptc.schema=InvocationSchema
vds.db.driver=MySQL
##################BNL production VDC
vds.db.driver.url=jdbc:mysql://db1.usatlas.bnl.gov/gce
vds.db.driver.user=gce_admin
vds.db.driver.password=gce_admin
##################UChicago development VDC
#vds.db.driver.url=jdbc:mysql://griddev.uchicago.edu/gce
#vds.db.driver.user=gce_admin
#vds.db.driver.password=gceadmin
#vds.db.file.store            /home/rwg/gce-client/gce-client/var/vds.db
##################################
vds.scheduler.remote.queues  UNM_HPC=usatlas,BNL_ATLAS=cas3, BNL_ATLAS_BAK=cas3
vds.transfer.mode            multiple
vds.exitcode.mode            all
 

2a) If you selected the local VDC option, then Install ATLAS transformations into your local VDC.  (DC2 production managers: if you selected the production VDC at BNL, skip this stepNote: Xin Zhao is the official manager of the production VDC; contact him if you need a new transformation installed.  If you attempt to execute this, expect to see Java stack trace errors.  Don't worry, no harm was done to the database, nothing overwritten.).

Notes: Make sure you've setup gce-client.  You might get a java object heap error (shown below, with fix).
 
[rwg@tier2-02 ~]$ install-tr atlas
Error occurred during initialization of VM
Could not reserve enough space for object heap
[install-tr]: vdlt2vdlx failed

FIX:

[rwg@tier2-02 ~]$ setenv VDS_JAVA_HEAPMAX 1024

OLD EXAMPLE:

% grid02> install-tr atlas
2004.12.06 09:46:18.835 CST: [app] parsing "atlas.vdx"
2004.12.06 09:46:19.411 CST: [app] Adding evgenx
2004.12.06 09:46:19.424 CST: [app] Adding g4digitx
2004.12.06 09:46:19.450 CST: [app] Adding g4simx
2004.12.06 09:46:19.476 CST: [app] Adding g4simxM
2004.12.06 09:46:19.498 CST: [app] Adding pileup
2004.12.06 09:46:19.502 CST: [app] Adding dd
2004.12.06 09:46:19.505 CST: [app] Adding ddm
2004.12.06 09:46:19.513 CST: [app] Adding user_exe
2004.12.06 09:46:19.517 CST: [app] modified 8 definitions
2004.12.06 09:46:19.518 CST: [app] rejected 0 definitions
 

3) Configure Windmill.  See sample windmill.xml.

Edit file <Windmill-install-dir>/Windmill/windmill-0.8.15/data/windmill.capone and save it with the name windmill.xml. Choose capone as the executor type, and the ATLAS transformations to be 8.0.5.1, and Pythia event generation (evgen) (Note for DC2 production we don't use the lines 'uses' and 'implementation' anymore - you need to delete them) :
 
<windmill>
<!--
exetype - choose executor type
             eg. rocinante/capone/lexor/nordugrid/pbs/testdrive/testgrid...
   grid - optional, specify local grid LCG, NORDUGRID, GRID3
   uses - optional, to request specific transformation or package
   implementation - optional, to request specific job type
   currentstate - optional, use only if requested by production manager
-->
<exetype>capone</exetype>
<uses>8.0.5.1</uses>
<grid>GRID3</grid>
<implementation>dc2.evgen.pythia</implementation>
 Set the fakejobs flag to false (this will submit jobs to the grid):
 
<!--
fakejobs (true=1, false=0)
-->
<fakejobs>0</fakejobs>
 

Set the maximum number of jobs to be sent to the executor at one time (maxjobs).   This should be set to something modest (like 2 or at most 10) since the overhead on submission leads to high loads on the submit host. maxsent is the maximum number of jobs that Windmill will send to Capone without notification of acceptance; it should not be necessary to change this parameter. maxstarted is the maximum number of jobs that can be in execution at any given time.  For long running jobs (jobs with long execution times compared to the time required to submit, this should be set to the maximum desired, or that can be reliably managed by your submit host).

 
<!--
   maxjobs - maximum number of jobs in each block to send for execution
   maxsent - maximum total number of jobs waiting for execution
   maxstarted - maximum total number of jobs currently in execution
   -->
   <maxjobs>2</maxjobs>
   <maxsent>100</maxsent>
   <maxstarted>100</maxstarted>
The other important parameters which control the submission rate (for a fixed response to the query of numjobswanted, which Capone currently implements) are the polling parameters:
 
<!--
   jobPolling - initial interval in secs to poll for jobsWanted
   infoPolling - interval in secs to ask for infoExecutor
   statusPolling - initial interval in secs to poll for jobStatus
   -->
   <jobPolling>30</jobPolling>
   <infoPolling>100</infoPolling>
   <statusPolling>600</statusPolling>
   <verifyPolling>600</verifyPolling>

The interval between which the supervisor queries the executor for numjobswanted is equal to:
maxjobs * jobPolling =  # seconds between numJobsWanted queries


Job submission rate:  min(maxjobs,numJobWanted) / (maxjobs * jobPolling)

To increase this, reduce jobPolling.

IMPORTANT: Set the Jabber account information with a unique <resource> tag (in this example, rwg1) for every new session.  Otherwise, Windmill will confuse jobs from two users running off the same submit host.  If you start a new instance or session, Windmill will try to give you jobs from a previous session if you don't change this.  Right now, recovery of past failures is not supported by Capone.
 

 
<!--
supervisor - jabber account information
-->
<supervisor>
<name>supervisor</name>
<pass>insider</pass>
<resource>rwg1</resource>
</supervisor>
<!--
executor - jabber account information
-->
<executor>
<name>executor</name>
<pass>insider</pass>
<resource>rwg1</resource>
</executor
Choose database source of job (oracle=DC2 databases, production and development, here xxxxx is the corresponding password)
 
  <!--
   choose database type (true=1, false=0)
   -->
   <fakedb>0</fakedb>
   <oracle>1</oracle>
   <oraconnection>atlas_prodsys/xxxxx/@atlas</oraconnection>
   <mysql>0</mysql>
Modify the executor.wsdl  file to point to your host where you will be running capone (must match capone.ini).
Go to <Windmill-install-dir>/Windmill/windmill-0.9.15/capone/executor.wsdl.
 
<service name="executorService">
   <documentation>WSDL file for executorService</documentation>
   <port name="executorPort" binding="tns:executorBinding">
      <SOAP:address location="http://tier2-02.uchicago.edu:8043"/>
   </port>
</service>

Get Ready to Run

Need to create three processes:
  1. one is for the supervisor
  2. one is for the jabber web service proxy
  3. one is for the executor web service (capone)
These can be on three separate machines.  Capone must run on the machine corresponding to the one selected in capone.ini and executor.wsdl files, as specified above.   For simplicity, one can do all this from a single host with three separate windows.

0) Temporary fix to avoid a Java stack heap limitation (hopefully fixed in next VDT version):
 

 
% setenv VDS_JAVA_HEAPMAX 1024
 

1) Setup your grid environment 

Go to your GCE-Client install directory, and do %source setup.csh.   Create a grid proxy certificate valid for two days.
 
%grid-proxy-init -valid 48:00
Your identity: /DC=org/DC=doegrids/OU=People/CN=Robert W. Gardner Jr. 669916
Enter GRID pass phrase for this identity:
Creating proxy ............................................ Done
Your proxy is valid until: Fri May 14 17:12:03 2004
%
 

2) Start condor.  You might first check if you have one running already, eg:  %ps -ef | grep rwg.
 

 
%condor_maste
This sets up the local Condor G queue, which is the queue to Grid3 sites.  To turn the condor queue off when finished:
 
% condor_off -master
3) Start the Capone web service (this must be the first process started).  This example shows how to start capone as a background process:
 
 
 
%[rwg@tier2-02 capone]$ ./capone bgstart
Starting Capone
Pid:[rwg@tier2-02 capone]$ 
[rwg@tier2-02 capone]$ 
[rwg@tier2-02 capone]$ ./capone check
Capone running, PIDs:7024
7026
7027
2) Start the jabber proxy process (this connects the executor web service to the Jabber message switch):
 
 
%<install-dir>/Windmill/windmill-0.9.15/launch_executor
3) Finally, start the supervisor process itself:
 
 
%<install-dir>/Windmill/windmill-0.9.15/launch_supervisor
{Type "print" in the window to enable debugging}

Monitoring Jobs

Things should start happening automatically, as the supervisor pulls jobs from the production database, feeds them to Capone, which submits them to Grid3 using Chimera/Pegasus/Condor-G.  You should see lots messages in each of the windows  that are somewhat self evident.
 
%condor_q
 
This gives you information about the jobs in your queue. You can get globus specific information by doing:
 
%condor_q -globus
You should see a plot of the number of running jobs by VO.   The jobs submitted under Capone will appear as "usatlas1" in a bar chart.   If you hover the mouse over the bar, you'll get a summary of current processing.  You can drill down for more information by clicking on the bar.
Go to the left menu window, and look for jobs by VO, select the ATLAS subfolder in "VO JOBS".  Clicking on "Real Time" you'll get the number of running ATLAS jobs by site.
This is a Java client program which queries the MonALISA database at the iGOC and produces various useful plots.
The program can be downloaed using Java Webstart.  Click here to download.

Troubleshooting (this needs to be greatly expanded)

Jobs pending

<capone-install>/Capone-only/capone/var/cpe<capone ID>.status

eg:
<capone-install>/Capone-only/capone/var/cpeCPE_8320_.status
 

Sometimes there are very long lines in this file.  in that case, do:
%cut -c1-80 cpeCPE_8320_.status
 

The job status, from the Capone point of view, is indicated there.

Complete Capone logging information, by job, is located in this directory:
<capone-install-dir>/Capone-only/capone/var/jobs

But Capone only knows about jobs from what it can receive from condor_q commands.  If jobs are in an "I" (idle) state, then you'll have to dig deeper into various condor logfiles.

Authentication problems

%globusrun -a -r atlas.bu.edu
 GRAM Authentication test successful

Upgrading your GCE-Client installation without Pacman

This is recommended for rapid updates which don't require a re-installation of the VDT on your submit host.  Note -- if you have a shared file system between multiple submit hosts, you must do a separate Pacman install for each submit host.  (We will fix this in future releases of GCE).
% wget http://grid.uchicago.edu/caches/gcl/tarballs/gce-client/gce-client-0.4.3.tar.gz


% tar -xzf gce-client-0.4.3.tar.gz

Upgrading your Capone without Pacman

then do
 
[rwg@tier2-04 capone]$ make update
Chech the Makefile or type make info after the update to know your new version.
rm -f ../capone-current.tar.*
rm -f ../capone-current.tar.*
cd ..; /usr/bin/wget
http://grid.uchicago.edu/caches/gcl//tarballs/capone/capone-current.tar.gz
--02:32:24-- 
http://grid.uchicago.edu/caches/gcl/tarballs/capone/capone-current.tar.gz
Resolving grid.uchicago.edu... done.
Connecting to grid.uchicago.edu[128.135.102.67]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3,647,554 [application/x-tar]
100%[==============================================================>]
3,647,554     11.26M/s    ETA 00:00Check the Makefile or type make info after the update to know your new
version.[rwg@tier2-02 capone]$ make update =>

`capone-current.tar.gz' 
02:32:24 (11.26 MB/s) - `capone-current.tar.gz' saved [3647554/3647554]
mv etc/capone.ini etc/capone.ini.preupdate
mv lib/gce-client lib/gce-client.preupdate
cd ..; /bin/tar -xzf capone-current.tar.gz
mv etc/capone.ini.preupdate etc/capone.ini
rm lib/gce-client
mv lib/gce-client.preupdate lib/gce-client

[rwg@tier2-04 capone]$ make info
Capone version 0.4.12, tag v0_4_12
Expected: WM 0.8, GCE-Client 0.5.1

Upgrading pool.config and tc.data

to update pool.config and tc.data you can issue 'make updatesites' from the gce-client directory. This will get the latest pool.config and tc.data templates from CVS and apply the changes to make them local.  The script is gce-client/install-siteconfig.sh
 


R. Gardner (rwg at uchicago dot edu)
Copyright © 2004  [University of Chicago]. All rights reserved.
12/08/04