HECURA 2007 Mid-Year Review Notes

Day 1: 02/14/2007

Gary Grider from Los Alamos National Lab Introduction

Government decided to fund R&D in HPC, particularly in filesystems.
Fill gaps, avoid duplicate efforts. 
Universities need more data and access to resources to conduct research.
Funded 19 proposals, plus four more.  23/62 proposals have been funded, 4/3 or better at full price.  Over 13million $ total.  3 year projects.  Focus on gap areas.
INCITE program helped provide computing resources to universities.  Reviewing standards for POSIX I?O API to add functionality for clusters and pNFS standard.  Getting more involved in education curricula to focus HEC needs. 

Categories of needed focus:  Metadata, Measurement and Understanding, QOS, Security, Next gen. I/O architectures, FS related communications and protocols, Management and RAS, Archive, Assisting Research.
List of funded proposals.

NSF CPA call this year is another way to fund filesystems and I/O, expecting about 6-8 to be funded, winners will be announced at a later date.
2 filesystems & I/O were funded through SCIDAC2.

PetaScale Data Storage Institute, tutorials at supercomputing.  5 year activity, Garth Gibson lead PI.

Scientific data management center for enabling technologies (CET) about 15 million, PI Arie Shoshani.

DOD NSA ACS Program, fund R&D projects for NSA needs, there is a fs and I?O component to this program. 2-300,000 this year for prototyping, more next year.  3rd year cherry pick.  At that point if there are good projects they will be funded for product development.  Gary Grider chairing for DOD.  Sabbatical’s for Maryland, industrial or university people can apply.  BAA’s and Grants also, funding directly through NSA.  Pre-product research more than publications.

We want to coordinate and sponsor R&D and enable more/better R&D.

DATA/Code/Information Availability, LANL made available data for the last 9 years.  Size, installation, memory, decommission date (if applicable), etc. And failure data.  Interrupts that any machine experienced and what LANL thought caused it.  Free data available to anyone.  There’s also usage data.  Submitted job, how many procs requested, how much time, etc.  Other people started provided data to, LLNL paper for bluegene, HP labs, library of congress, Pittsburgh supercomputing center, etc.  Efforts to put out traces for codes, LLNL, UofChicago did some, Sandia also.  Efforts for CFDR (Computer Failure Data Repository).  Vendors are reticent to put out this data.

Classroom efforts: Institute at LANL, SciDAC PDSI.


Scott Brandt from UC Santa Cruz

“End-to-end performance management for large, distributed storage”
Progress Report

Work in real time systems, work to combine the 2 areas of research into one project.

Project Management:
Started October 2006
Budget: $1million for 3 years.

Key Personnel:
Collaboration with IBM
3 students.

Improve end-to-end performance management in large clustered storage. 
From client, through server, to disk.  Control performance, guarantee/ensure service level, isolate traffic.
4 Stages in the I/O path:
Disk traffic, storage/server cache, flow control across network, management of client cache.

Goals for each stage:
Disk Traffic – universal I/O scheduler, any application with any time requirement can be guaranteed time.  Isolation between different application sessions.
Server cache management - integrate cache management with I/O scheduling.  Shape low-level traffic using cache.  Allocate cache to match performance needs.  Results: investigating problem.
Network flow control - Goals, manage data flow across client-server network connection, storage flow control separate from transport flow control, about ensuring endpoint resource availability.
Client Cache management –

Research Plan, structured around 4 stages in I/O path.  Disk traffic, server cache, network, client cache.

Research Progress:
Utilization as an efficient guaranteeable Metric of Disk – paper submitted to USENIX
Utilization-based disk scheduling
Storage systems serve mixed workloads with performance requirements:
General purpose, responsiveness, multimedia, soft deadlines, background, non-interference, Transactions, high performance.
Goals: guaranteed performance, isolation, disk performance
Utilization captures application behavior (random vs sequential)

Timely Disk I/O: requests are non-preemptible, execution times are stateful, execution times are partially non-deterministic, best, average, and worst case execution times can vary by orders of magnitude.  0.1 to 50 milliseconds.

Achieving guarantees: guaranteed performance, isolation from other workloads.  Usual approaches, physical or temporal partitioning, costly.  Overprovisioning, costly and only partially effective.  Reserving bandwidth or I/O rate, limits reservable performance.

Limitations of bandwidth reservation.  Lacks knowledge of application workloads, I/O patterns, most use worst-case assumptions of disk request times, result, very low reservable bandwidth.  Everything else is managed as slack, unreserved resources.

Utilization as a common metric: Solution, use utilization for resource reservation and request scheduling.  Easy reservable, accountable, guaranteeable.  Universally commersurable, independent of application workload or behavior.  Efficient and effective, leveraging application efficiencies.

Flow of I/Os through the scheduler diagram.
If application doesn’t use time slice then it’s slack and you give it to someone else.
Controlling performance via throughput, random stream low straight line, lots of seeking.  Sequential line higher up and more varied.
Controlling performance via throughput II diagram, Efficiency of Utilization and throughput diagram.  It is a real disk that they built.  2 logical block address that are far away is equivalent to random I/O, if they are together it is sequential.

Throughput is not robust., high sensitivity to errors.  Utilization conclusions.  Many applications require I/O performance guarantees.  Throughput is natural, but requires worst-case assumptions that give very low reservable throughput and poor control.  Utilization is more effective, natural way to express performance requirements, captures application behavior, easily reservable, easily accountable.
Good results so far.

Efficient QoS-aware Disk Scheduling.  (in preparation) – part 2.
Want to make arbitrary reservations, need 50% of disk for next x seconds, etc.
Goals: mixed hard, soft, and non-real time workloads.  …..
Background: universal CPU scheduling.  Computer systems growing in complexity, want different timelines, ….
Resource allocation and dispatching, how much and when.  Most schedulers use one thing that blurs the line.  If you control separately you have more control, this is what RBED does.  Rate, what % of cpu.  Period, for how long.  Can dynamically adjust.
Fahrrad: RAD based I/O scheduling.  Utiization-based reservation, with deadlines, requests put in queues, Micro-deadlines assigned based on target rate and worst-case assumptions.  Requests released to Disk Scheduling Set (DSS) based on miro-deadline, requests scheduled for service from DSS, Micro-deadlines updated based on actual service times.
Fahrrad architecture diagram.  Deadlines assigned to each request, di = di+1 +WCET / U
Release to DSS diagram
A few details: Empty slots, must accommodate varying request patterns – can’t assume all requests arrive early ….
EDF – earliest deadline first DSS scheduling.  Issue requests by deadline, increases sequentially …
Hard real time requests were causing other stuff not to fill the area, shortening length of queue and hurting performance.  Slot swapping, can take different amounts from various streams as long as everyone gets their time.  All done in device driver so there are only  few requests on disk at a time.  Fahrrad Conclusions, Universal I/O scheduler that provides…Excellent isolation between processes, good performance.  Next steps, performance tuning and parameter study.
End-to-end performance management, disk scheduling well in hand, utilization-based disk scheduling, new universal I/O scheduler.  Client cache research started.  Networking and server cache future work – this is where distributed applications come in.  How does a distributed application issue a single request?  Will look at in next few years.  Thinking about putting together with CPU QoS.

Is utilization a reasonable way for application designers to specify needs?
Definitions in terms of throughput need to be converted to utilization anyway in order to determine if the need can be realistically met

What about when an application doesn't use its time?
That becomes slack that will be reallocated to someone else

How much do these scheduling algorithms depend on the underlying disk?
Not at all.  The only thing that matters is that a seek takes a certain amount of time.

How is it natural to specify utilization?
When you say I want 10mb/sec, you can't say that in the abstract. You have to know whether that is random or sequential and then it is easy to convert into a utilization.

Random or sequential on disk or from application?
On disk.  But you can profile to determine it. This is useful for sequential workloads when the worst case actually is much worse than actual needs.

Mapping application requirements to disk utilization is well known.  Products have done it for 10 years. The question is: when you do that, do you have enough control to make better guarantees.

Part 2. Similarities to CPU scheduling work done by UCSC
Isn't context switching radically different?
Yes, now it is seeking and we can't control it as we can in CPU scheduling.

What do you mean by efficient?
By efficient, we mean that the disk is working in an efficient way.  For example, if two workloads are sequential, we don't want to interleave requests in such a way that we would require a seek after every request because then it would be a random workload to the disk.

What if you have an application with low throughput and low latency requirements?  How do you give it the small latency?
The deadline is separated from the rate.  You specify both requirements.

The tradeoff of having a longer period is that you have less control.

Where does application specification occur for parallel apps?
As we move to disk level, beyond single server, we will look at that.

Adaptive based on application behavior (rather than fixed requirement specifications)?
This is done for CPU scheduling so this is likely to be future work for disk scheduling.



Dr. Hong Jiang from University of Nebraska

“SAM2 Toolkit: scalable and adaptive metadata management for high-end computing”
Status Report

Research plan:
Develop multi-variable forecasting model, file name mapping, locality aware metadata grouping schemes

Grant Management:
Subcontract was established in Dec. 2006 for Jun Wang at UCF.  Equipment funded from Sun ($64k)

Implemented a prototype of our HBA distributed metadata management scheme (Cluster04, ICPP06).  Preliminary results promising.  Graphs showing average latency and aggregate throughput.
Extension of HBA scheme, DCBF, for name space management through distributed and counting BFs that exploit metadata access locality with MDS grouping (work in progress).  R-tree with Bloom Filters (RBF) A new storage Structure for space efficient queries for multidimensional metadata in OSS (Work in Progress FAST07).  A test framework has been set up to gather information about metadata access behaviors in dCache.
Improving metadata reliability through popularity and locality based reconstruction schemes (work in progress; preliminary work on file data (read most) to be presented at FAST07).  Extending our Nexus metadata grouping scheme (CCGrid06) by using data mining techniques (wip).  Trying to identify appropriate interdisciplinary use of statistical models to analyze the metadata access patterns of both local file systems and cluster-based file systems.

Impediments to Success: 
One major one is we have limited long-term I/O traces of large scientific applications.  Long-term meaning length of run.  Still looking for representative cluster based file systems (unsure if PVFS is sufficiently representative) and real-world applications (we have Flash IO, CMS tier 2 site in UNL).  Hasn’t done object based system – wouldn’t matter for metadata?  The activity is similar.  I/O traces won’t give you this information, but it would give opens. 

What do you mean by long term?
Jobs that run for days or weeks.

Publishing and Presentations:
CCGrid06, ICPP06, FAST07, IEEE.

funding came late, missed 06 recruiting season.  Have 41/2, trying to get 2 more.


Day 2: 02/15/2007

John Chandy, University of Connecticut

“Active Storage Networks”
Progress Report

Motivated by active disks – only has local view of the data
Network has a global view all data

ASN Application operations – offload work to the network
Reduction operations, database queries, scientific applications.
Transformational operations – sorting, scientific applications, stream-based.
File System caching

Most of work in Redundancy optimizations; RAID 5 operations in switch.  Update parity, read from both.  Can eliminate some traffic by doing calculation in the switch.  Implement raid in FS, can do data splitting and striping from within switch as well.

Network Structure – bus, crossbar, butterfly, Benes
Switch implementation: each node compute node; FPGA. 
Research Goals

Worked on this for past 3 months.  Started initial design on switch.
Apply RAID 5 mechanism at target level for object-based storage file systems. 
Decided to use Lustre for this work – take pages that are coming from Linux VFS, generate parity, and split over OSD.  Lustre doesn’t do this today, they are talking internally about it but haven’t. 
On drawback is that Lustre is very tied to Linux VFS.  There is a  mismatch between Lustre’s large objects and Linux 4K page size. 
Small file issues and locking and consistency of data.  One object, 5 pages long, with current VFS you have to wait until all pages are ready to generate first parity, wastes lots of time and space.  Have to reorganize how Lustre orders the data.
What if the data your waiting on to compute parity never comes?
Have to pass it from VFS. 
Why hold it all? 
Lustre won’t send pages in the order needed, it sends one page and then the next. 
You could just accumulate the parity, i.e. carrying parity through and calculate xor between available blocks, then you no longer need all blocks.
The problem is in terms of figuring out how much data is coming down – could do on the fly. 
If you don’t get enough data what do you do? 
That’s why they need information about how long the write is – VFS splits it into pages.  Initially thought XORS in the ASN.  Still doing evaluation of what should be in ASN and what should stay in client.  If they didn’t give a full stripe it would still be the same problem – the ASN would split it up. 
ASN would have to have almost a nearly full blown client?
Wouldn’t have to be full blown, the ones their looking at have Linux – it’s almost like a proxy.  FPGA lower cost lots of implementation in hardware so don’t need as fast of a processor.
Do you update the object’s parity location? 
At file creation time, you specify the parity. File is on x object and then you just add one more for parity. 
Enabling scalable rebuild and other things that RAID can’t do.  Per file optimization, you can pick raid level for different needs.  No error cost path analysis has been done yet.  So far the RAID implementation is almost done – getting all the paths covered is tricky. 
Could be combining from multiple clients – that’s the plan.  Set some kind of block size on which to calculate parity, know offset in file is x – will have to pass that down.  Really is kind of proxy.  If you have multiple clients doing parity calculations they would have to do read update. 
Metadata caching and optimizations: the idea of effects for small file operations.  Most of cpu load in lookups, hardware is very good at string matching not even using embedded processor. Higher CPU load from many small file operations compared to few large files

NS2 Simulations – NS2 doesn’t have switch module, started work on one.  Haven’t done any routing but preliminary results are from very simple model.  Final piece is the actual hardware implementation.  Xilinx XUP2VPBoard doesn’t have gigabit Ethernet, students adapting SATA port to do Ethernet. 

Need applications to characterize that would exercise RAID. Need large sets of data, immutable to change. Could use synthetic that is close to real application.  Reduction for sure.  Applications are better than traces since you could modify the code.
Lustre has a huge learning curve. Considering using PVFS since it is in user space; easier to work in.

Seven students working on the project. 

Go back to network diagram. If we think of it as a raid controller with cache, suppose there is a fault in the network and there is data in the buffer cache. The nodes hold cache what do you do to mitigate failure? 
It's all write data so must be pushed through. Perhaps use mirroring for fault tolerance.
Research wasn’t planning to look at fault mitigation but could think about it. Clients hang onto partial pages until it’s been calculated. 

Does Lustre perform write through or write back? 
Thinking it is write through. 

Lots of opportunities for speed up, using (small) memory on nodes. Cache management and policies will be challenge. 

Cluster should have been there in November, still waiting for it. 
How amenable is CFS to taking back some of information and integrating it? 
Talked to Peter Brahm about it and he seems to be amenable. 

Does the network need to be fully connected?
A high performance switch provides it. Interior of network is rich in terms of connectivity. That’s what it had to be to be comparable to HPC switch. 

How is data routed through network?
Depends on traffic that’s already there.  Based on contention.  Parity is computed at each stage, you’re building a tree out from the compute node and building parity as you go through, then eventually route back to the appropriate storage node.  Structure provides a lot of opportunities – is it more than needed – need to think about. 
Request for graph analysis.
Each node of graph would be fpga, would like to see weight, delta, to see what highly congested network would look like. 8x8 graph, distance graph (edge between any two nodes), show how the latency depends on the route to the nodes


Kai Shen, Univeristy of Rochester

“Concurrent I/O Management for cluster-based I/O storage”
Progress Report

Goal: Develop techniques to analyze performance problems and identify causes. Focus on concurrent I/O and target commodity hardware and open source systems (pvfs2, Lustre); need access to code to find problems. 

Research progress in the following areas:
1. Systematic performance problem analysis
2. OS support for concurrent I/O

Computer servers running MPI, storage servers run PFS. So, there are multiple layers to analyze. System has deep I/O stack. 
Wanted realistic applications with broad range of coverage. Started with about 10 applications and ended up using 4; ior_mpiio, mpi-tile-io, NPB3.2IO-MPI, and Mandelbrot-par. 
Example of IO Read Throughput
ior_mpiio test on purple – smaller block size (256 KB) didn’t perform well, but is good for large block size.  Granularity of application, stripe size is always the same. 
Does this test write to one file or one file per process?
Number of files depends on the application. 

I/O Trace Collection
There are a fixed set of events that are traced at every level.
Small application I/O degraded performance. 

Results of Trace Analysis:
- anticipatory I/O scheduling
Queues on storage server should be long enough – storage system can’t take all the requests at once.  Problem is multiple small contiguous requests from one process, I/O scheduler only sees one request at a time.  Multiple remote mpi processes.

Isn’t that just bad code? 
Yes, many codes do that. 
Problem is choice of scheduler?  The technique handles it but with parallel I/O the scheme doesn’t work that well anymore. 
- slow return of reads that hit cache is an issue in pvfs
To fix it simply you would have multiple threads as proxies of remote processes.  You could change the anticipatory scheduler but not trying to.  Didn’t put mmap model in PVFS2. 
- client write calls are slow because pvfs has server side buffer
I/O hits cache but takes a long time to return. Long delay in write call returns at the client side.  PVFS has no client-side buffering. 
Question: What is meant by long? 
Answer: About 100 microseconds. 
Put in some rough fixes, threads for each additional I/O to force concurrency, more aggressive prefetching technique in OS, using anticipatory scheduler (works well up to 2.6.17).  Worked for about ½ a year.  Need to collect more events and get more automatic way to tell if they are important.
Concurrent sequential I/O.  Looking at enhancing realism of disk simulator so students can look at I/O and performance issues. 
Is event tracing at filesystem level or in MPI? 
In filesystem now, but could do either. Choice of tracing layers depend on system architecture. 
How do you make the layers for trace collection generic?
These layers are specific to pvfs.
What is reusable?
The small set of events we are tracing are really general. The process is what’s really new; the discovery for how anyone can do this.  Right now we are tracing as many events as it can find, and then we’ll look at which ones to focus on. Current thread of research is to trace many events in the traces and compare between correct systems and wrong systems -> this should show source of problems and what should be traced to indicate a problem.

Small project - 1 PI, 2 students. 

The interesting part is the techniques that can be applied to other operating systems, not just PVFS.  The process of determining problems is more interesting than the specific results.


Day 3: 02/16/2007

Greg Ganger for Priya Narasimhan, Carnegie Mellon University

“Toward automated problem analysis of large-scale storage”
Status report

Automate problem diagnosis and use to predict failures. Identify which part of the system has a problem and what the root cause is.

Determining root cause from problems is difficult:
Can have multiple causes with a single manifestation
Can have single causes for a multiple manifestation

Need different kinds of tools for different problems. 

When a request arrives at storage system there are breadcrumbs spread throughout the system and when the request hits the crumbs they record things.  They also help figure out the demand on various resources. 
This is not free; have to justify instrumentation in terms of cost, increases more than linearly with scale, there’s overhead for having them there. 

Create request flow graph and then use machine learning to get common usage.  Cluster requests and look at when things were good vs. when they were bad. 
Input is simple tables where machine learning techniques are easy to apply
Systems - complicated constructs that aren't easy to map into input to ml techniques
Example usages:
Sample usage of resource X periodically and mine for anomalies in the time series data
Watch for changes in measured characteristics
Cluster the requests -- how do they change between no problems and problems

Early Exploratory Probes:

  1. Fingerpointing and predicting failures

Find out which one went bad first and try to predict what will go bad in the future.  Injecting faults and using instrumentation – periodic sampling of resource information from the system - /proc, libpcap, etc.  The statistics usually change in the same manner effectively giving most failure types a signature. 
Some experience using replicated servers – inject faults (/thread leaks, packet losses, babbling, race conditions)
Example: Inject a memory leak in primary server; when that starts, free memory goes down over time until it crashes
             Machine learning tools show when the two cases diverge; compare server with the fault induced to the expected situation when there is no memory leak)
Fault manifestations "travel" across nodes
             All servers, primary and backup, may slow down
             If you look at the wrong metric, all nodes look the same (ex: packets/sec with a memory leak)
Collecting certain metrics can make the problem obvious?
Yes, if you capture at all points, you can find the problem, but we are looking at only certain points and sampling to reduce overhead of instrumentation
 Need information inside components and not just between
See tech report and a publication in SysML 2007 on this work.

  1. Fingerpointing with logged error information

Some early experience with Emulab (Network Emulation Testbed). Lots of errors reported (80 emails/day). Want to automatically analyze them and prioritize to fix, but logs must be structured to mine them.   
Can cluster them based on which part of code and which set of machines are affected
See publication in WORLDS06 (Kasick and Narasimhan)

  1. Performance Debugging

Method: assist user with performance debugging by diagnosing slowdowns. 
First try: fake problem where we assume all cache requests are misses
Second: real problems from CVS logs
Convert request traces into directed graphs of requests with elapsed time on edges.  Cluster sets of requests to find common flow path and anomalies and look at what groups are doing.  Compare cluster results before and after a problem is detected. Clusters that only appear in one or the other are potential sources of the problem. Can see when requests are taking longer in one case or another
This is still early work
Example: The OS for Linux kernel 2.6 is doing something slower that 2.4 did. 

Long-range goal: system that automatically predicts failures and fixes problems before they happen. This will require more instrumentation (start with a lot, then reduce to the smallest amount that is useful)

Machine learning research is needed as well to make them work (and be efficient). Working with ML people (Alice Zheng is a postdoc)

How do you tag requests? 
When it crosses the network or moves to another thread it is tagged. 
How will you handle false positive under normal workload changes?
There isn’t anything definitive to avoid false positives to account for workload.  If there are changes in workload and changes in system, see if they match
How large are the request flow graphs?
The request flow graph size vary a lot, a cache hit has 3 or 4.  100 edges is one of the bigger ones.  A lot of it depends on how big the system is configured – our test system is small. Will scale as the number of nodes working on the request scales
How will this early research move to your longer-term goals?
How do you know when it "works" or is done?
We know instrumentation and statistical tools will be involved. It is still early but #1 and #3 seem to be promising so far. It is hard to quantify goals for the progress of the project.  Need to get data on common errors to focus failure injections.

Are your Machine Learning tools and techniques code available?
k-mean clustering tool is online


Greg Ganger, Carnegie Mellon University

“Performance Insulation and Predictability for Shared Cluster Storage”
Status Report

Moving toward shared storage but there is the issue of interference hurting storage system performance:
 - For a disk,  context switches = seek are bad.
 - Cache is space shared component. Cache is not shared fairly; time-sharing when one is processing twice as fast.
 - Coordinate timing across nodes in cluster

From the Argon work presented earlier (at FAST07)

For single server, working with known techniques, but many problems remain with multiple servers, such as data striped over servers or work on network issues, i.e. David Nagle work on network (incast) problems and flow control

Look at network side and disk side (orthogonal issues)

Can R-value be used to help guide placement?
How can we exploit slack while still providing insulation?
Merge quanta when it's ok
How does R-value change when workloads change?
             Detect and adjust
Timescale -- workload changes very quickly (maybe timescale is too coarse)
Application sensitive to latency within certain bound.  If the system can help you predict you may be able to see if you can mix workloads.  No workloads share disk during quanta which has potential to waster disk time.  Want to mix them together by using the slack, use R value to determine if you can mix them.

QoS control system?
Efficiency is nice, but can we guarantee amount of work to be done

Goal: Control and predict shared storage

Funding: This project is only 1/3 funded by HECURA. 

This is very similar to what Scott Brandt at UCSC is doing. Could they collaborate?
Scott started at the top, real time tools need insulation, not already there. Richard Golding collaborating with both

To get to active disks, need to share and secure properly. Need to control another consumer of the resource (that disk).
How are the reservation demands made?
1000 node job, every node requests an amount of service. Seems to work better when consumers are asking for different amounts
Is each core processor making its own requests or does the job make a single request?
We are thinking one big job is one reservation; need to intelligently make that reservation
You could use a group requests so you can handle the single job reservation; group open. Pass job association along so you know nodes doing the open are related. They you can guarantee service for the job.


HEC FSIO Website designed and hosted by Los Alamos National Laboratory.
Email Contacts: Project Leaders :: Webmaster