Monday, October 5, 2009

Looking Up Data in P2P Systems

These days, peer-to-peer systems are a staple for people who are constantly scouring the Internet for files that serve their interests (songs, movies, TV shows, EBooks, etc). They have become a universal tool for sharing and grabbing files, legal or otherwise. Their popularity is driven not only by their convenience, but also of their notorious capability to source some of the hardest to find data, like MP3s that one may not be able download even through the Web itself.

There are more to P2P systems, though, than what they are famous (or infamous) for. Taken for granted is the fact that they are also a product of rigorous research and development. The short article titled “Looking Up Data in P2P systems”[1] presents this other side of P2P systems, which is more academic in nature. Its focus is the problem that is at the heart of P2P systems: the lookup problem – how to search for data across distributed and decentralized systems in a scalable manner.

Previous Approaches to the Lookup Problem
The article presented some previous approaches to the lookup problem.

First off is the approach used by the once popular Napster. It made use of a central database for mapping file names to the location of the servers that store them. The problem here is that, if the central database fails, the entire system will also fail.

One approach that uses hierarchy is the name lookups for the Internet’s Domain Name System (DNS). Searching is done by the traversing the path from the root up to the node that contains the target data. This approach, though, also suffers from the similar problem faced by Napster. If the root or any of the nodes positioned at the higher levels of the hierarchy fails, then a large part of the system will also probably fail. These higher level nodes also carry more load than the leaf nodes.

Because of the problems associated with hierarchal structures, the article noted that some P2P systems went on to use symmetric lookup algorithms. In this kind of structure, each node carries the same amount of burden, which mitigates the effect of losing some nodes due to failure.

An example of this scheme is the forwarding of broadcast messages until a request reaches its destination. This approach, though, is costly and doesn’t scale well. Gnutella uses an approach similar to this, but with provisions to avoid request loops.

FastTrack’s P2P platform included “superpeers” in a hierarchal structure to handle the scaling problem. But then again, adaptability to failures is compromised just like in the case of DNS, aside from the fact that successful retrieval is not guaranteed. KaZaA is one of the applications that popularized this platform.

Freenet also forwards request node per node until the target object is reached. Its searching is based on unstructured routing tables that are created through caching. It avoids mapping of documents to a predictable server to provide anonymity. This approach, however, may cause unpopular documents to be left unassigned to any server. Searching may be costly also, and successful results are not guaranteed.

Distributed Hash Table (DHT)
As an alternative to structured and symmetric schemes, some of the recent P2P algorithms (e.g. CAN, Chord, Kademlia, Pastry, Tapestry, and Viceroy) use a combination of both these schemes, resulting to a system that offers guarantees, while not being vulnerable to some node failures.

These new P2P systems offer a simple approach for the lookup problem through the use of Distributed Hash Tables (DHTs).

DHTs use only a single function: lookup(key). This returns the network address that is mapped to the given key. Mapping is made possible through the use of a base hash function such as SHA-1. Since the result of using a hash function is unique for a given input, then mapping to network locations are consistent.

Here are some issues that have to be addressed when implementing DHTs to lookup algorithms:

Mapping keys to nodes in a load-balanced way. Keys are stored to one or two nodes whose IDs are “close” to the key, based from the given ID space.

Forwarding a lookup for a key to an appropriate node. Nodes must be able to forward queries to nodes that are “closer” to the key identifier being searched.

Distance function. Schemes that make use of DHTs must have a definition for “closeness” so requests can be forwarded until they reach their destination.

Building routing tables adaptively. Nodes must be able to update their routing tables during nodes joins and failures to maintain correct forwarding of messages.

Routing in One Dimension
One of the things that make the various algorithms that use DHTs different from one another is the structure that they implement to allow O(logN) lookups.

For one dimensional structures, examples are Chord, Kademlia, Pastry, Tapestry, and Viceroy. Chord uses a skiplist (or ring) like data structure. Kademlia, Pastry, and Tapestry have nodes that maintain a tree-like data structure. Viceroy uses a butterfly data structure.

Chord: Skiplist-like routing.
(Please refer to the succeeding blog for a more detailed discussion about Chord.)

Tree-like Routing
Nodes in algorithms that use tree-like structures maintain information about nodes under specific prefixes and their location. Pastry, Tapestry (please refer to the succeeding blog for a brief discussion about Tapestry), and Kademlia uses this kind of structure.

Pastry, for instance, provides its nodes with a randomly chosen ID that determines their position on an identifier circle. Each node maintains a set of nodes that are closest to it (called a leaf set). To search for a key, nodes are checked if the requested keys are under their leaf set. If not, messages are routed to the node with the ID that is numerically closest to the key being requested.

Routing in Multiple Dimensions
An example of this approach is CAN (please refer to the succeeding blog for a brief discussion about CAN).

Analysis and Feedback
The article that was discussed here provided a clear overview of the lookup problem in P2P systems. By citing previous approaches to this problem, including their shortcomings, it emphasized the relevance of addressing this issue. Indeed, with P2P systems’ growing user base, there’s no arguing that the feature that makes these systems appealing to mass consumers should be given its due attention.

An indication that the lookup problem is now being given a closer look is the recent wave of P2P lookup algorithms that were presented by the article. An interesting feature of these new approaches is their possession of both structure and symmetry. Even more interesting is the fact that at the core of their implementation is a simple concept called the DHT.

For me, I find DHTs an appealing tool to the said problem because it seems like the ideal abstraction for situations that concern the mapping of unique identities, such as in the case of assigning a data item to a specific network resource. This makes searching determinate: given a query, either you know exactly where to find it or you know that the query will fail.

Of course, the use of DHTs is definitely not the only way to solve the lookup problem. There might be more complex structures out there that can achieve the same thing. But with its simplicity, it seems like a perfect match for a problem that, conceptually speaking, is simple too.

References:
[1] Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica. Looking Up Data in P2P Systems. Communications of the ACM, Vol. 46, No. 2. February 2003.

No comments:

Post a Comment