Over the last thirteen years, my research has covered a wide range of topics related to the design and management of large, dependable computer systems. Throughout my career, I have been striving to solve real-world problems by applying theoretic and algorithmic principles of distributed computing in the design and development of actual systems.
For my PhD work, I focused on the design of novel process replication protocols for critical software systems that need to be highly available. Prior art in the field required that service clients be members of the same broadcast group as the process replicas offering the service. As a result, such approaches had serious scalability problems with the achieved performance and the number of supported clients. My thesis proposed protocols that do not require the clients to be members of the replica group, while providing the same reliability and ordering properties as traditional approaches (e.g., ISIS). My research involved both theoretical work for the design and correctness proofs of the proposed protocols, as well as the experimental evaluation of protocol prototypes in a state-of-the-art middleware platform, Regis (a research system developed in the Distributed Software Engineering group at Imperial College). My SRDS-1995 paper describes the key idea of my thesis. The 1999 IEEE Transactions on Software Engineering paper provides more details about the theoretical aspects of the work with a detailed enumeration of design options. From a practical perspective, the incorporation of the protocols in Regis followed the novel idea of configurable protocol stacks. The approach (an extension of Ritchie's original concept of "streams") allowed for dynamic instantiation (at binding time) of a protocol stack consisting of just the necessary lightweight protocol modules according to high-level specifications, an idea that was concurrently explored by the Horus project at Cornell. The obvious advantage of this approach is better performance: one does not have to pay for functionality that is not required for a certain communication session. Another main advantage is the ability to argue for the correctness of relatively small and simple protocol modules, as opposed to huge monolithic stacks.
The latter problem motivated me to look into more rigorous approaches for arguing about the correctness of large systems that implement complex distributed algorithms. Making a formal correctness argument requires some verification technique that involves exhaustive exploration of the system's possible states. All such approaches have an inherent problem: any system of realistic size would result in an enormous number of states to explore ("state explosion"). I, thus, looked into the application of a new and promising approach that was very applicable to the modular protocol stacks I was designing for my thesis: Compositional Reachability Analysis. Some initial ideas on that topic were published in my EDOC-2000 paper (a case study using transactional workflow systems).
Since I joined HP Labs in May 2000, my research focus shifted further towards the design and management of large distributed systems, mostly enterprise-class storage systems. Initially, my focus was on cluster file systems aimed at meeting very high throughput goals as is typical in commercial data centers. The objective was to meet the requirements for scalability while being compatible with standard interfaces, namely NFS and CIFS. One of the main challenges in that context was how to maintain highly available (and thus distributed) and at the same time consistent metadata in the system. My SRDS-2001 paper discusses an efficient protocol for ensuring the consistency of a distributed namespace in the presence of multiple clients initiating concurrent namespace updates without centralized coordination.
The next step in this line of research was to explore an even bolder scenario: a wide-area distributed file system (designed in a completely decentralized way, a.k.a. peer-to-peer) that allows users distributed across the globe to share data (read/write sharing) using a traditional file system interface usable by any existing application. Even the feasibility of such a system was questionable when we started this work. Note that existing p2p file sharing systems allow read sharing and file addition/removal but not traditional read/write sharing with in-place writing. The resulting system, Pangaea, was based on two fundamental principles: 1) an aggressive data replication protocol that created a full replica of a file at each location where it was accessed from; 2) a novel optimistic consistency protocol that propagated updates in a lazy fashion that was aware of the heterogeneous nature of the different p2p connections among the nodes of the system. This research resulted in a publication at OSDI-2002 (the main systems conference alternating every second year with SOSP).
Pangaea followed a very aggressive approach for replica creation, assuming that storage capacity is cheap. This is true to a big extend. However, managed enterprise storage is still of non-negligible cost, especially capacity in high-end disk arrays and NAS appliances. This was the motivation to look into more sophisticated replica placement algorithms. This is an NP-complete problem and essentially my work involved a number of heuristics described in my IWQoS-2004 paper. The practical implications of these heuristics are described in the IDCS-2004 paper.
One of the goals for replica placement is to meet certain latency goals (given some budget constraints). The latter work motivated me to look deeper into the problem of enforcing performance goals in large systems (storage systems being one example) that are shared by multiple competing workloads (e.g., of different applications or different users). This is a major problem in today's data centers where computing services of multiple users or even multiple companies are consolidated in a common infrastructure for ease of management and lower cost (due to statistical multiplexing of the workloads). So, the last project I led while at HP Labs focused on the problem of effective and efficient performance differentiation. Trying to develop an approach that is more generally applicable, we assumed that the target infrastructure is a black box and that we have no a priori knowledge of the applications or of their workloads. We used a fair scheduler to enforce proportional sharing of the aggregate throughput of the shared service (e.g., cluster file system, 3-tier e-commerce site) among different users. The share of each user is dynamically adjusted using a controller (designed using adaptive control theory) in a way that maximizes the yield of the service resources given the Service Level Agreements (SLAs) specified for the different users. A paper introducing a controllable scheduler was published in Performance-2005 (a sister conference to Sigmetrics). Our work introduced the first adaptive control-theoretic controller in the literature that was applied to a software system. The systems perspectives of designing controllable computer systems were outlined in my HotOS-2005 paper. A more recent paper discussing all the details of the closed-loop design is currently under review.
In October 2005, I joined VMware as a senior engineer responsible to lead the company's advanced development efforts in the fields of Disaster Recovery and Continuous Data Protection. The emerging virtualization approaches for both processors and storage provide a unique opportunity for introducing services that can boost the dependability of computational services in non-disruptive way to legacy operating systems and applications. There are two challenges in this context, which I plan to investigate in the future: 1) develop techniques that offer increased dependability taking advantage of the decoupling between physical and virtual resources in the data center; 2) develop automated management approaches that can map high-level performance and dependability goals to the appropriate configuration and tuning of system services (e.g., data mirroring and placement, resource provisioning, virtual machine migration).