Windmill, Capone, and
GCE: How-To Execute Jobs on Grid3
Install and test GCE-Client
1) Working area
%cd /grid/data2a/users/rwg/
|
2) Install Pacman
% mkdir Pacman-latest
% cd Pacman-latest
% wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-3.0100.tar.gz
% tar -xvzf pacman-3.0100.tar.gz
% cd pacman-3.0100
% source ./setup.csh
/grid/data2a/users/rwg/Pacman-latest/pacman-3.0100
Your Python is version 1.5.2. Building Python 2.2.3 for Pacman:
Untarring...
Building..
Cleaning up...
Python [2.2.3] has been built.
Source setup.csh(sh) one more time and you're ready to use Pacman 3.
source ./setup.csh
/grid/data3a/users/rwg/Pacman-latest/pacman-3.0100
|
NOTE: 06-dec-2004: Pacman3
is still experiencing problems successfully installing VDT-1.2.x
UNTIL FURTHER NOTICE - USE Pacman-version 2
2.1) Install Pacman
version 2.126
in a fashion similar to the above.
% wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-2.126.tar.gz
% (etc)
|
3) Install GCE-Client
% pacman -get GCL:GCE-Client
(takes about 20-40 minutes depending on network)
The amount of storage needed for this package is about 560 Mbytes.
|
afterwards:
[rwg@tier2-02 ~/gce-client]$ stripsetup
[rwg@tier2-02 ~/gce-client]$ source setup.csh
[rwg@tier2-02 ~/gce-client]$ vdt-version
You have installed the complete VDT version 1.2.1:
Virtual Data System 1.2.13
ClassAds 0.9.5
Condor 6.6.6
EDG CRL Update 1.2.5
EDG Make Gridmap 2.1.0
Fault Tolerant Shell (ftsh) 2.0.5
Globus 2.4.3 plus patches
GLUE Information providers, (CVS version 1.79, 4-April-2004)
GLUE Schema 1.1, extended version 1
GPT 3.1
GSI-Enabled OpenSSH 3.4
Java SDK 1.4.2_05
KX509 2031111
Logrotate 3.3
Monalisa 1.2.12
MyProxy 1.11
Netlogger 2.2
PyGlobus 1.0
PyGlobusURLCopy 1.1.2.11
RLS 2.1.5
UberFTP 1.3
[rwg@tier2-02 ~/gce-client]$ vds-version
1.2.13
[rwg@tier2-02 ~/gce-client]$ gce-version
++++++++++++++++++++++++++++++++++++++++++++
GCE Client 0.5.43 installed at Fri Nov 12 13:40:20 CDT 2004
++++++++++++++++++++++++++++++++++++++++++++
|
Install Capone
Create a separate directory, cd to it and do (note: you must
use Pacman3 for Capone-only,
and you must setup GCE-Client first):
[rwg@tier2-02 ~/Capone-only]$ pacman -get GCL:Capone-only
Do you want to add [GCL] to [trusted.caches]? (y or n): y
Package [Capone-only] found in [GCL]...
Downloading [capone-0.6.11.tar.gz] from [grid.uchicago.edu]...
874/874 k downloaded...
Untarring [capone-0.6.11.tar.gz]...
[rwg@tier2-02 ~/Capone-only]$
|
Install Windmill
Create another separate directory for windmill, and do (must be
Pacman3):
[rwg@tier2-02 ~/windmill]$ pacman -get UTA:Windmill
Do you want to trust the cache: [UTA]? (y or n): y
Package [Windmill] found in cache [http://heppc12.uta.edu/pacman/]...
Do you want to trust the cache: [BU-ATLAS]? (y or n): y
Package [DC2-Base] found in cache [http://atlas.bu.edu/caches/]...
Package [DC2-Registry] found in cache [http://atlas.bu.edu/caches/]...
Package [Atlas-Snapshots] found in cache [http://atlas.bu.edu/caches/]...
Downloading [windmill-0.9.15.tar.gz] from [www-hep.uta.edu]...
14/14 Megs downloaded...
Untarring [windmill-0.9.15.tar.gz]...
4 packages in the installation...
4 nodes in the dependency tree...
[rwg@tier2-02 ~/windmill]$
|
Configuration
1) Configure Capone (make sure you've first setup GCE-Client)
-
Tell Capone where to find GCE-Client; in <capone-install-dir>/Capone-only/capone,
do:
% ln -fs $GCE_LOCATION lib/gce-client
|
-
Go to <capone-install-dir>/Capone-only/capone/etc and edit capone.ini
(see parameters and comments therein).
-
I changed the job output directory: (THIS IS HIGHLY RECOMMENDED!!!)
...
jobDir: /home/atlas/rwg/dc2
...
|
-
Also, I changed the capone web service host and port so as to not interfere
with others. This has to be changed
in executor wsdl file in Windmill also (see below).
#Web Services configuration (host/port)
[ws]
port: 8043
host: tier2-02.uchicago.edu
|
-
In the CPE configuration, register output files in RLS
##########################################
#CPE configuration
##########################################
[cpe]
#
1 to Register the output files to RLS (0 not to)
regOutput: 1
|
-
Select RLS servers for query (rliURL) and registration (lrcURL):
#
#UChicago rls://grid01.uchicago.edu/ # use for development
#rliURL: rls://grid01.uchicago.edu
#lrcURL: rls://grid01.uchicago.edu
#
#BNL rls://atlasgrid02.usatlas.bnl.gov # use for production
rliURL: rls://atlasgrid02.usatlas.bnl.gov
lrcURL: rls://atlasgrid02.usatlas.bnl.gov
|
-
Important [scheduler] configuration settings:
-
Scheduling algorithm: DC2 recommendation: weighted round robin (WRR)
-
Compute elements (CEs) and storage elements (defaultSE) that you want Capone
to select. Examples are shown.
###########################################
#scheduler
###########################################
[scheduler]
#Possible scheduling policies: WRR RR WRC Wrandom random override
# 'WRR'WeightedRR
# 'RR' RoudRobin from the list of available Sites
# 'WRC' WeightedRandom with Consumption (order is random but the share is the same of WRR)
# 'Wrandom' Weighted random
# 'random' (default) Randomly select one CE
# 'override' selects always the defaultCE
policy: WRR
# Used only when 'override'
defaultCE: UC_ATLAS_Tier2
#To limit the CEs to choose from:
####################examples of CE lists and weights
#CEs: ANL_HEP ANL_Jazz BNL_ATLAS BU_ATLAS_Tier2 CalTech_Grid3
CalTech_PG FNAL_CMS FNAL_SDSS ISI IU_ATLAS_Tier2
IU_iuatlas JHopkins KNU PDSF UC_Grid3 UCSanDiego UCSanDiego_PG UFlorida_Grid3
UFlorida_PG UM_ATLAS UNM_HPC UTA_DPCC
#weightCEs: 1 0 30 86 24 132 277 0 35 208 4 0 50 449 3 0 592 40 82 29 426 218
#ces: ANL_HEP BNL_ATLAS BU_ATLAS_Tier2 CalTech_Grid3 CalTech_PG FNAL_CMS ISI
IU_ATLAS_Tier2 PDSF UC_Grid3 UCSanDiego_PG UFlorida_Grid3
####################real lists:
CEs: BNL_ATLAS_BAK BU_ATLAS_Tier2 UC_ATLAS_Tier2 IU_ATLAS_Tier2 UTA_DPCC
weightCEs: 176 86 32 208 218
stageout: override
#double slash to patch current bug in stageout.bash
# Used when 'override' or when no hint from supervisor is available/possible
defaultSE: gsiftp://aftpexp.bnl.gov//usatlas/data01/prod/dc2/captest/
#defaultSE: gsiftp://aftpexp01.bnl.gov//usatlas/data01/prod/dc2/captest/
#defaultSE: gsiftp://aftpexp02.bnl.gov//usatlas/data01/prod/dc2/captest/
#1 To have a scheduling only log file
log: 1
logFile: var/scheduler.log
|
Storage element policy settings:
#Possible SE scheduling policies: RR random override
# 'RR' RoudRobin from the list of available Sites
# 'random' (default) Randomly select one SE
# 'override' selects always the defaultSE
# 'hint' uses the SE hint - not implemented
stageout: RR
#######fixed?double slash to patch current bug in stageout.bash
# Default, used when 'override' or when no hint from supervisor is available/possible
defaultSE: gsiftp://aftpexp.bnl.gov/usatlas/data01/prod/dc2/captest/
#defaultSE: gsiftp://grid02.uchicago.edu/grid/data2a/ATLAS_SE/DC2/captest/
# SE list (space separated) aftpexp01.bnl.gov (VDT 1.1.14) aftpexp02.bnl.gov (VDT 1.1.13)
aftpexp.bnl.gov
SEs: gsiftp://tier2-01.uchicago.edu/share/data2/atlas_SE/dc2/captest/
gsiftp://gridftp.usatlas.bnl.gov/usatlas/data01/prod/dc2/captest/
gsiftp://aftpexp01.usatlas.bnl.gov/usatlas/data01/prod/dc2/captest/
gsiftp://aftpexp02.usatlas.bnl.gov/usatlas/data01/prod/dc2/captest/
gsiftp://grid02.uchicago.edu/grid/data2a/ATLAS_SE/DC2/captest/
|
2) Configure GCE to use the MySQL VDC:
Change <gce-install-dir>/GCE-Client/vds/etc/properties
(Pacman3 installed) or <gce-install-dir>/vds/etc/properties (Pacman2
installed) with (example show for production):
vds.replica.mode rls
vds.rls.url rls://atlasgrid02.usatlas.bnl.gov
vds.tc.file /home/rwg/gce-client/gce-client/etc/tc.data
vds.pool.mode xml
vds.pool.file /home/rwg/gce-client/gce-client/etc/pool.config.xml
vds.home.localstatedir /home/rwg/gce-client/gce-client/var
##################################
#To use the database uncomment the following lines and comment the vds.db.file.store
vds.db.vdc.schema=ChunkSchema
vds.db.ptc.schema=InvocationSchema
vds.db.driver=MySQL
##################BNL production VDC
vds.db.driver.url=jdbc:mysql://db1.usatlas.bnl.gov/gce
vds.db.driver.user=gce_admin
vds.db.driver.password=gce_admin
##################UChicago development VDC
#vds.db.driver.url=jdbc:mysql://griddev.uchicago.edu/gce
#vds.db.driver.user=gce_admin
#vds.db.driver.password=gceadmin
#vds.db.file.store /home/rwg/gce-client/gce-client/var/vds.db
##################################
vds.scheduler.remote.queues UNM_HPC=usatlas,BNL_ATLAS=cas3, BNL_ATLAS_BAK=cas3
vds.transfer.mode multiple
vds.exitcode.mode all
|
2a) If you selected the local VDC option,
then Install ATLAS transformations into your local VDC. (DC2
production managers: if you selected the production VDC at BNL, skip this
step. Note: Xin Zhao is
the official manager of the production VDC; contact him if you need a new
transformation installed. If you attempt to execute this, expect
to see Java stack trace errors. Don't worry, no harm was done to
the database, nothing overwritten.).
Notes: Make sure you've setup gce-client.
You might get a java object heap error (shown below, with fix).
[rwg@tier2-02 ~]$ install-tr atlas
Error occurred during initialization of VM
Could not reserve enough space for object heap
[install-tr]: vdlt2vdlx failed
FIX:
[rwg@tier2-02 ~]$ setenv VDS_JAVA_HEAPMAX 1024
OLD EXAMPLE:
% grid02> install-tr atlas
2004.12.06 09:46:18.835 CST: [app] parsing "atlas.vdx"
2004.12.06 09:46:19.411 CST: [app] Adding evgenx
2004.12.06 09:46:19.424 CST: [app] Adding g4digitx
2004.12.06 09:46:19.450 CST: [app] Adding g4simx
2004.12.06 09:46:19.476 CST: [app] Adding g4simxM
2004.12.06 09:46:19.498 CST: [app] Adding pileup
2004.12.06 09:46:19.502 CST: [app] Adding dd
2004.12.06 09:46:19.505 CST: [app] Adding ddm
2004.12.06 09:46:19.513 CST: [app] Adding user_exe
2004.12.06 09:46:19.517 CST: [app] modified 8 definitions
2004.12.06 09:46:19.518 CST: [app] rejected 0 definitions
|
3) Configure Windmill. See sample windmill.xml.
Edit file <Windmill-install-dir>/Windmill/windmill-0.8.15/data/windmill.capone
and save it with the name windmill.xml. Choose capone
as the executor type, and the ATLAS transformations to be 8.0.5.1, and
Pythia event generation (evgen) (Note for DC2 production we don't use the
lines 'uses' and 'implementation' anymore - you need to delete them) :
<windmill>
<!--
exetype - choose executor type
eg. rocinante/capone/lexor/nordugrid/pbs/testdrive/testgrid...
grid - optional, specify local grid LCG, NORDUGRID, GRID3
uses - optional, to request specific transformation or package
implementation - optional, to request specific job type
currentstate - optional, use only if requested by production manager
-->
<exetype>capone</exetype>
<uses>8.0.5.1</uses>
<grid>GRID3</grid>
<implementation>dc2.evgen.pythia</implementation>
|
Set the fakejobs flag to false (this
will submit jobs to the grid):
<!--
fakejobs (true=1, false=0)
-->
<fakejobs>0</fakejobs>
|
Set the maximum number of jobs to be sent
to the executor at one time (maxjobs).
This should be set to something modest (like 2 or at most 10) since the
overhead on submission leads to high loads on the submit host. maxsent
is the maximum number of jobs that Windmill will send to Capone without
notification of acceptance; it should not be necessary to change this parameter. maxstarted
is the maximum number of jobs that can be in execution at any given time.
For long running jobs (jobs with long execution times compared to the time
required to submit, this should be set to the maximum desired, or that
can be reliably managed by your submit host).
<!--
maxjobs - maximum number of jobs in each block to send for execution
maxsent - maximum total number of jobs waiting for execution
maxstarted - maximum total number of jobs currently in execution
-->
<maxjobs>2</maxjobs>
<maxsent>100</maxsent>
<maxstarted>100</maxstarted>
|
The other important parameters which control
the submission rate (for a fixed response to the query of numjobswanted,
which Capone currently implements) are the polling parameters:
<!--
jobPolling - initial interval in secs to poll for jobsWanted
infoPolling - interval in secs to ask for infoExecutor
statusPolling - initial interval in secs to poll for jobStatus
-->
<jobPolling>30</jobPolling>
<infoPolling>100</infoPolling>
<statusPolling>600</statusPolling>
<verifyPolling>600</verifyPolling>
|
The interval between which the supervisor
queries the executor for numjobswanted is equal to:
maxjobs * jobPolling = # seconds
between numJobsWanted queries
Job submission
rate: min(maxjobs,numJobWanted) / (maxjobs * jobPolling)
To increase this, reduce jobPolling.
IMPORTANT: Set the Jabber account information with a unique
<resource>
tag (in this example, rwg1)
for every new session. Otherwise, Windmill will confuse jobs from
two users running off the same submit host. If you start a new instance
or session, Windmill will try to give you jobs from a previous session
if you don't change this. Right now, recovery of past failures is
not supported by Capone.
<!--
supervisor - jabber account information
-->
<supervisor>
<name>supervisor</name>
<pass>insider</pass>
<resource>rwg1</resource>
</supervisor>
<!--
executor - jabber account information
-->
<executor>
<name>executor</name>
<pass>insider</pass>
<resource>rwg1</resource>
</executor
|
Choose database source of job (oracle=DC2 databases, production
and development, here xxxxx is the corresponding password)
<!--
choose database type (true=1, false=0)
-->
<fakedb>0</fakedb>
<oracle>1</oracle>
<oraconnection>atlas_prodsys/xxxxx/@atlas</oraconnection>
<mysql>0</mysql>
|
Modify the executor.wsdl file to point to your
host where you will be running capone (must match capone.ini).
Go to <Windmill-install-dir>/Windmill/windmill-0.9.15/capone/executor.wsdl.
<service name="executorService">
<documentation>WSDL file for executorService</documentation>
<port name="executorPort" binding="tns:executorBinding">
<SOAP:address location="http://tier2-02.uchicago.edu:8043"/>
</port>
</service>
|
Get Ready to Run
Need to create three processes:
-
one is for the supervisor
-
one is for the jabber web service proxy
-
one is for the executor web service (capone)
These can be on three separate machines. Capone must run on the machine
corresponding to the one selected in capone.ini and executor.wsdl files,
as specified above. For simplicity, one can do all this from
a single host with three separate windows.
0) Temporary fix to avoid a Java stack heap limitation (hopefully fixed
in next VDT version):
% setenv VDS_JAVA_HEAPMAX 1024
|
1) Setup your grid environment
Go to your GCE-Client install directory,
and do %source setup.csh. Create a grid proxy certificate valid
for two days.
%grid-proxy-init -valid 48:00
Your identity: /DC=org/DC=doegrids/OU=People/CN=Robert W. Gardner Jr. 669916
Enter GRID pass phrase for this identity:
Creating proxy ............................................ Done
Your proxy is valid until: Fri May 14 17:12:03 2004
%
|
2) Start condor. You might first check if you have one running already,
eg: %ps -ef | grep rwg.
This sets up the local Condor G queue,
which is the queue to Grid3 sites. To turn the condor queue off when
finished:
3) Start the Capone web service (this must be the first process
started). This example shows how to start capone as a background
process:
%[rwg@tier2-02 capone]$ ./capone bgstart
Starting Capone
Pid:[rwg@tier2-02 capone]$
[rwg@tier2-02 capone]$
[rwg@tier2-02 capone]$ ./capone check
Capone running, PIDs:7024
7026
7027
|
2) Start the jabber proxy process (this connects the executor web service
to the Jabber message switch):
%<install-dir>/Windmill/windmill-0.9.15/launch_executor
|
3) Finally, start the supervisor process itself:
%<install-dir>/Windmill/windmill-0.9.15/launch_supervisor
|
{Type "print" in the window to enable debugging}
Monitoring Jobs
Things should start happening automatically, as the supervisor pulls jobs
from the production database, feeds them to Capone, which submits them
to Grid3 using Chimera/Pegasus/Condor-G. You should see lots messages
in each of the windows that are somewhat self evident.
-
First thing to do is look at the local condor queue on the capone submit
host:
This gives you information about the jobs in your queue. You can get
globus specific information by doing:
-
Look at the production database:
You should see a plot of the number of
running jobs by VO. The jobs submitted under Capone will appear
as "usatlas1" in a bar chart. If you hover the mouse over the
bar, you'll get a summary of current processing. You can drill down
for more information by clicking on the bar.
Go to the left menu window, and look for
jobs by VO, select the ATLAS subfolder in "VO JOBS". Clicking on
"Real Time" you'll get the number of running ATLAS jobs by site.
This is a Java client program which queries
the MonALISA database at the iGOC and produces various useful plots.
The program can be downloaed using Java Webstart. Click
here
to download.
Troubleshooting (this needs to be greatly expanded)
Jobs pending
-
%condor_q -globus
returns that jobs are pending for an unreasonably long time.
-
Capone-level systems checks:
-
there is a period status summary printed, but you can always go look at
the capone status file with the most up to date information. This
file is
<capone-install>/Capone-only/capone/var/cpe<capone
ID>.status
eg:
<capone-install>/Capone-only/capone/var/cpeCPE_8320_.status
Sometimes there are very long lines in this file. in that case,
do:
%cut -c1-80 cpeCPE_8320_.status
The job status, from the Capone point of view, is indicated there.
Complete Capone logging information, by job, is located in this directory:
<capone-install-dir>/Capone-only/capone/var/jobs
But Capone only knows about jobs from what it can receive from condor_q
commands. If jobs are in an "I" (idle) state, then you'll have to
dig deeper into various condor logfiles.
-
Capone log file -- send this to Marco when you have problems.
It is in
-
<capone-install-dir>/Capone-only/capone/var/capone.log
-
All jobs append to this file, so you may want to delete it from time
to time.
Authentication problems
-
Of course you need to be in the gridmapfile on the compute site.
From your submit host, make sure you can do something like:
%globusrun
-a -r atlas.bu.edu
GRAM Authentication test
successful
-
Examine log and error files in the run-time directory you specified in
the Capone configuration.
Upgrading your GCE-Client installation without Pacman
This is recommended for rapid updates which don't require a re-installation
of the VDT on your submit host. Note -- if you have a shared file
system between multiple submit hosts, you must do a separate Pacman install
for each submit host. (We will fix this in future releases of GCE).
-
Old instructions:
-
Go to your GCE client installation directory; source the setup.
-
cd to GCE-Client; Rename the gce-client subdirectory.
-
Get the new gce client package directly from the pacman cache and untar:
% tar -xzf gce-client-0.4.3.tar.gz
-
New instructions:
-
Got to your GCE client installation directory; source the setup.
-
Do: %make update
-
cd ../gce-client
-
Now complete the install the new gce-client: cd to the newly created gce_client
directory.
% install-gce-client.sh
*Note* this overwrites the chimera configuration file in $VDS_HOME/etc/properties.
The previous version is icopied to properties.sav. Most likely
you'll want to revert to your saved version.
Upgrading your Capone without Pacman
-
cd to your capone install directory. Then cd to <capone-install>/Capone-only/capone.
-
check to see if ./etc/capone.ini exists
-
check to see if ./gce-client sym link exists.
then do
[rwg@tier2-04 capone]$ make update
Chech the Makefile or type make info after the update to know your new version.
rm -f ../capone-current.tar.*
rm -f ../capone-current.tar.*
cd ..; /usr/bin/wget
http://grid.uchicago.edu/caches/gcl//tarballs/capone/capone-current.tar.gz
--02:32:24--
http://grid.uchicago.edu/caches/gcl/tarballs/capone/capone-current.tar.gz
Resolving grid.uchicago.edu... done.
Connecting to grid.uchicago.edu[128.135.102.67]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3,647,554 [application/x-tar]
100%[==============================================================>]
3,647,554 11.26M/s ETA 00:00Check the Makefile or type make info after the update to know your new
version.[rwg@tier2-02 capone]$ make update =>
`capone-current.tar.gz'
02:32:24 (11.26 MB/s) - `capone-current.tar.gz' saved [3647554/3647554]
mv etc/capone.ini etc/capone.ini.preupdate
mv lib/gce-client lib/gce-client.preupdate
cd ..; /bin/tar -xzf capone-current.tar.gz
mv etc/capone.ini.preupdate etc/capone.ini
rm lib/gce-client
mv lib/gce-client.preupdate lib/gce-client
[rwg@tier2-04 capone]$ make info
Capone version 0.4.12, tag v0_4_12
Expected: WM 0.8, GCE-Client 0.5.1
|
Upgrading pool.config and tc.data
to update pool.config and tc.data you can issue 'make updatesites' from
the gce-client directory. This will get the latest pool.config and tc.data
templates from CVS and apply the changes to make them local. The
script is gce-client/install-siteconfig.sh
R.
Gardner (rwg at uchicago dot edu)
Copyright © 2004
[University of Chicago]. All rights reserved.
12/08/04