Charge to Data Management Needs Assessment Group: ------------------------------------------------ Data management includes the hardware and software required to support all aspects of CDF and D0 computing beginning from the moment that raw data becomes available at the downstream end of the DA to the moment at which it is delivered to the physicist for analysis. The charge to this working group is: 1) to specify in very general terms the requirements -- data volumes, access speeds, latency requirements, storage for temporary data, CPU requirements, and permanent storage -- for all phases of the analysis. (These have been written down many times but this is an opportunity to get a single consistent list and for the two collaborations to understand any differences between them); 2) to create a glossary of terms which are used by the two collaborations to describe the datasets, processes, activities,and existing systems in this area to facilitate communication. Of course, any agreement on standardization of terms is most welcome; 3) to specify the kind of software needed at each stage to catalog the data flow, to track all the datasets, and to access the storage; and 4) to the extent posssible, state the human resource needs (or more like the practical limits of the resources available) to manage the system. Among the issues that need to be considered are: 1) The need to 'fall back' and redo various aspects of the processing. How far back does one have to be able to go? What are the requirements on redoing the step (how fast, how much effort). 2) What data must be accessible on disk? What data must be accessible robotically? At what rate (peak, average) does the data need to be accessed. 3) Where in the processing will streaming occur? Who are the customers for the streamed sets? How often will the streamed sets be accessed and do they need to be retained indefintely? If so, on what kind of storage? 4) Can the data be recorded directly to robotic storage over the network? What precautions must be taken to insure that no data is lost due to network, robot, and software downtime? 5) What measures need to be taken to make sure that problems in the downstream portions of the processing pipeline do not hold up the upstream portions? 6) How will dst data be accessed and reduced? What is the access model (independent of the implementation of the data model) for analysis (highly central, highly distributed, workgroup)? How much 'private' storage is required for each physicist/analyst under the chosen model? How much data storage? How much IO bandwidth is needd per analyst? 7) What data (type, quantities) needs to be delivered to remote users? If the network alone will not suffice as the delivery medium, what other media are acceptable/required? 8) What software is required by the user to access the data at each level of storage? Many other issues will undoubtedly occur to you and should be added to the list. Because this is an area where there have been (economically) unrealistic expectations in the past, the working group should at least consider whether stated requirements have a chance of being translated into affordable systems based on their understanding of current and projected costs. We would expect one or a series of reports defining the issues and requirements. A goal would be to have a preliminary report in two months. Last revised February 4, 1996