======================== Status in form of WBS =============================
1 Production Management
1.1 Meetings and Management
Meeting is now every other Friday from 10:30 to 12:00 in FCC2B
1.2 Provide input to Hardware Procurements - such that procurement is appropriate to the products of Production Management
1.2.1 Input for Farms
Status: Will be done after Run II prototype is evaluated (mid-February) and before the start of the 1st phase of
Run II Farms Hardware procurement (latest date for input is May 1). The CDF/CD/D0 results using the prototype system is vital for being able to provide input to the procurement. Issues of Data Flow are the biggest concern 'cause the impact local disk on worker nodes, network access to Mass Storage, and how the batch system will be used. By the end of January, CDF and D0 will provide updated CPU requirements and new scheduling information for their production ramp up. If PCs are used, industrial shelves rather than racks look quite cost effective and promising. While Linux still is very promising, it has not been easy to get PCs into a production state - being able to get e871efficiency on PC farms up above 80%, accurate Linux performance tools and NFS client/server issues understood would (in mind own mind) clinch PC/Linux viability for Run II Farms.
1.2.1 Input for Operations Central Server (E.g. home for centralized accounting, central systems monitoring, central control for backups, and central operations access to other systems. Existing farmx would go away.)
Status: We (DCS Dept, CSS Group and FCS Group) have discussed this and believe that we are close to starting the deployment in February. The functions served by this host are:
a) Central collection point for Fermi Unix Accounting
b) Central "launch" point for operators scripts such as requesting tapedrive cleaning, starting xoper using current OCS, FNALU LSF display
c) Xfalive monitoring
d) Home for OCS database server that handles system backups hosts (dcdsv0 currently handles this)
e) Central backups of OCS database backups (hppc department server currently handles this)
f) Central patrol collection point for system monitoring (new function)
g) Central point for the "new" OCS operator screen (e.g. potentially Oracle client of fncdug)
h) Boot host for operator X-terminals
i) Backup boot host for other Fermilab X-terminals
j) Central syslog server (swatch)
k) Potentially central server for key passwords (e.g. root) software such as escrow
The physical attributes would be:
a) fnsf224, a Challenge S, would be quite suitable.
b) rename it to fncops
c) Upgrade the system disk to IRIX 6.5.2
d) Add two 9Gb external disk drives
e) Add a Eliant tape drive for backups
f) Locate it in FCC1 on house power
g) Give it 24x7 support (CSS would handle core system FCS would handle FCS does special software and coordination with DCS)
h) Limit access strictly to fnal
1.3 Evaluate Software Components for Data Delivery and Characterize them
Status: So far, not a strong interest in this from CDF/D0 to work on this.
1.3.1 Nile User Interface
1.3.2 RFIO
1.4 FARMS
1.4.1 Farms Batch System (FBS - extension of LSF for Reconstruction and MonteCarlo Farms)
Status: Prototype (minus scratch disk space allocation) is ready and in use by E871 on PCs running Linux. E871's maximum efficiency is only around 50% when they are actively running. We are working with them to get this up to an acceptable level. We suspect/know that there are some FBS deficiencies for CDF/D0 as well, but want CDF/D0 to start using the prototype in ernest before we re-design it on our own. Options are:
a) Enhance current FBS architecture
b) Use exclusively LSF if can get LSF license cost down and verify LSFs performance on mock-up of a 300 node cluster
c) Strip LSF out of FBS archicture and write our own scheduler. Expect CDF/D0 evaluation to be done by February 19th. Scratch disk allocation in paticular is an area that we need to architect once we understand want CDF/D0 need. We would like to have this allocation scheme to replace what we currently use on the FNALU batch system as well.
Note, monitoring software under Linux apparently requires the Redhat 2.2 Kernel to work properly.
1.4.1.1 Provide (extended) batch system for scheduling and controlling jobs
1.4.1.2 Provide Scratch Disk Allocation
1.4.1.3 Provide Processor Allocation
1.4.1.4 Provide ability to track job history & system usage
1.4.1.5 Provide software for monitoring of system
1.4.1.6 Provide (extended) batch system for scheduling and controlling jobs
1.4.2 Provide development & test system
Status: 14 worker node prototype was delivered to CDF and D0 in November.
It was about 5 weeks later than desired:
a) 4 weeks due to SCSI vs EIDE disk, panel lights and cooling problems with the 18 systems delivered.
b) 1 week due to System Adm problems and staff availability.
1.4.2.1 Procurement and Delivery
1.4.2.2 Installation of OS and Products
1.4.2.3 Ongoing Operation and Support during development
1.5 Batch system
1.5.1 Purchase, Maintain and Support LSF
Status: Some current key points are:
a) We have been evaluating LSF 3.2, which is the first LSF release that supports Linux. So far, there are no known problems, though the licensing changed which is always a nuisance.
b) We held a meeting on December 15th with Platform Computing.
Overall, the meeting was quite a productive exchange of ideas. Platform computing seems will to negotiate to some degree the cost, but don't expect anything like what CERN got 'cause Platform claims to have not made any profit on it.
c) Currently, there is no known reason that LSF with not be the commercial batch package of choice for Run II.
d) In December, a Run II Batch Software Working Group was for to document requirements/feature for farms and analysis systems alike.
e) By mid-March (when the Run II Batch Software Working group report is complete) we should have a good idea as to how many licenses would be required for the farms as well as CDF/D0 analysis systems.
1.6 Construct Extensions for Data Center Services
1.6.1 Operators Interfaces to TapeDrives
Status: This includes totally revamping OCS so that it not only serves Run II, but also any other systems currently using OCS in FCC (FT'97, FT'99, FNALU, ACPMAPS, System backups, etc.) The
plan is to:
a) replace existing DBM database with ORACLE now that we have a site-wide ORACALE license.
(Note, this means that other institutions that want/need OCS to use higher level CDF/D0 software would have to deal with the ORACLE issus)
b) Support Linux
c) Concept of using tape drives in a networked fashion would go away except for viewing statistics and other reports. This would allow us to make the installation and stability of the software more robust
d) Existing user feature would be quite similar except for the concept of tape drive groups which is flexible to the point almost no one understands it.
e) Direct interface to the drive (e.g. to gather statistics) would go through FTT. Thus, OCS functionally closely coupled to FTT in this regard.
f) Statistics reporting of tape drive use could be gather to a central database.
g) Statistics Run II tapes managed by enstore and other software could be provided. Enstore and such other software would use the COMMON interface that OCS provides.
h) Better integration between the existing tapes database and OCS. (OCS will NOT replace the existing tapes database)
i) Better centralized control for operations (e.g. right now they need to have ~10 little independed screens where on would be far better.)
j) Interface to deal with robotics from the Operators perspective. E.g load a stacker, once-a-day Central MSS loading/removal of tapes. Hope to start on this in mid-February.
1.6.2 Centralized Accounting
Status: Special software needed to provide centralized accounting for Linux is complete and integrated in with SGI, AIX, SunOS and OSF1 reports. Note, this means that Linux clients need a portion of this special software and we are working to get in incorporated into the Fermilab Redhat release. Plan to re-vamp existing central accountings tools to be more easily managed, a FUE product, and have centralized/graphic reports.
1.6.3 Management Reporting Tools
Status: (I believe this to be redundant with item 1.6.4)
1.6.4 Centralized System Management Tools
Status: We are working with the patrol product (from DESY/SLAC?) to be used for system status and some automatic system recovery. We want to combine this with some of the xfalive features which would be web based. We have a local sysmon product (e.g. like xcpsmon) that will run under Linux and is packaged with the farms batch software. This is tk based rather than Motif. As mentioned before, it needs the Linux 2.2 Kernal for proper results.
1.6.4.1 Central Reporting Screen
1.6.4.2 Recommended Compliance Interface for subsystems
1.7 Documentation, Operational Delivery
Status: No status beyond that the documentation for the Run II Prototype Farms Batch System has been progessing quite well. (I estimate it is about 75% done, but keep in mind this is for the prototype and may not carry through to the real Run II.)
1.7.1 D0 Mock Data Challenge 1
1.7.2 D0 Mock Data Challenge 2
1.7.3 CDF Mock Data Challenge 1999
1.7.4 Preparation for Operations - acceptance testing
================ Current and Project Effort from FCS Group =================
Enclosed is the CURRENT EFFORT my group has been putting towards the CD catagories. They are a little different than what one sees reported in the division reports cause folks have been putting most of the time they have spent working with e871 using the Run II batch prototype under Run II rather than Fixed Target. I've asked them in future reports to put ANY e871 related work under Fixed Target.
I have also enclosed ESTIMTATED FUTURE EFFORT my group will be putting towards CD categories. I believe these estimates are rather optimistic towards what we can spend on Run II.
|
|
Current Percent Effort |
||||||||||||||||||
|
Effort Category |
|
|
|
|
|
|
|
|
|||||||||||
|
|
TJ |
GS |
MS |
JF |
TL |
IM |
MB |
FTE |
|||||||||||
|
ACPMAPS Mnt & Oper |
15 |
5 |
10 |
|
20 |
20 |
|
.70 |
|||||||||||
|
ACPMAPS Dev |
|
|
5 |
|
|
20 |
|
.25 |
|||||||||||
|
Fixed Target Mnt & Oper |
30 |
55 |
15 |
20 |
10 |
|
|
1.30 |
|||||||||||
|
Fixed Target Dev |
20 |
|
10 |
30 |
20 |
20 |
35 |
1.35 |
|||||||||||
|
Run II Dev |
20 |
25 |
25 |
35 |
35 |
25 |
50 |
2.15 |
|||||||||||
|
Dept Adm & Mgt |
|
|
20 |
|
|
|
|
.20 |
|||||||||||
|
General |
15 |
15 |
15 |
15 |
15 |
15 |
15 |
1.05 |
|||||||||||
|
|
Estimated Future Percent Effort |
||||||||||||||||||
|
Effort Category |
|
|
|
|
|
|
|
|
|||||||||||
|
|
TJ |
GS |
MS |
JF |
TL |
IM |
MB |
FTE |
|||||||||||
|
ACPMAPS Mnt & Oper |
15 |
15 |
5 |
|
15 |
15 |
|
.65 |
|||||||||||
|
ACPMAPS Dev |
|
|
5 |
10 |
10 |
20 |
|
045 |
|||||||||||
|
Fixed Target Mnt & Oper |
30 |
50 |
10 |
5 |
15 |
|
15 |
1.25 |
|||||||||||
|
Run II Dev |
40 |
20 |
45 |
70 |
45 |
50 |
70 |
3.40 |
|||||||||||
|
Dept Adm & Mgt |
|
|
20 |
|
|
|
|
.20 |
|||||||||||
|
General |
15 |
15 |
15 |
15 |
15 |
15 |
15 |
1.05 |
|||||||||||