How to check results of the completed job:
1. Go to SE (BNL) and "cd" to the corresponding directory.
2. Do "ls -la | grep '.suffix'" .
3. Find the absent "regions" (partitions) if exist.
4. Go to CE and "cd" to corresponding "working directory".
5. See if 3 corresponding output files are there or not.
7. Take a look on ".log" if exists.
6. Go to Submit Host and cd to "Client/run/'jobID' " directory.
8. Analyze corresponding log-files there.
Things to develop and improve:
1. Stop the gencmydag when an authentication error occurs.
2. Does new entry override the old one, like simulx_00011 (_xin_test1)
overrides simulx_00011 (_yuri_test2), i.e. same
short name different
long names for the two derivations?
3. How is the gass cache used in chimera?
4. md5sum check before registration starts and after file transfer is
done,
so RLS registration will not be done if file
transfer fails.
5. Transformation management on the Compute Element: (update catalog if
necessary) or keep transformations on the submit host?
6. .chimerarc prevents multiple submit packages from running from the
same host
GCE issues learned during SC2003
- Understand the failures encountered during SC2003 demo runs
- Please see Yuri's list.
- One more thing from my test: condor_dagman processes hang
around forever on submit host even though the jobs are all gone (done or
failed). I have talked with Peter during SC2003, he asked me to extend
the length of condor log files so that we can trace back to what
happened. It has been done. I need to follow-up on that with Peter.
- Some condor issues learned
- changes to condor_config file, like
- MAX_SCHEDD_LOG
- port number for negotiator/collector (maybe we don't even
need it
- ......
- GridMonitor switches (BTW, Jens has problem using GridMonitor
with Chimera, so we should hold on till VDT people fix it. in VDT 1.1.13
??)
- GCE
- use one big DAG to start multiple sub-DAG jobs instead of
starting one DAG for one partition. Dan has done this for btev. This way
we can save local resources on submit host and submit more jobs at one
time.
- some changes I had to make in GCE-Client in order to run top
samples. Basically they include: to use the "skip" parameter; to deal
with 4 digit partition numbers of top sample; to reduce sleep time
interval in gencmydag and genmydax.
- Jerry's changes to save monitoring information to the log file,
and change the memory size paramter.
- use MySQL DB as VDC
- incorporate genpoolconfig, gentcdata to the chain; use $TMP or
$TMP_WN for working directory
- xin's simple subjob script, which uses weighted round-robin
mechanizm and works with cookbook database
- new transformation for evgen
- quality assurance of output files
- CVS repository
- new name for GCE?
- ?? : should we put all the above changes
to GCE or re-design GCE
to avoid some of the long-standing pains, e.g.
don't hardwire file name convention in the scripts, and run the whole
chain (evgen-simulx-reconx) at one time ?
- RLS
- Need to upgrade to RLS 2.0.9 soon.
- Need to learn how to use the rls registration scripts.
- Remote install of GCE-Server
- Package is ready to use: "GCL:GCE-install"
- GCE on LCG
- new extra trnasformation to stage-in input data to SE by LCG
ftp server, to be used by worker node later on
- ......
- Requests to VDS developers
- add transfer verification mechanism
- add third-party transfer (Pegasus)
- ......