What is the PAPIA system?

Recently, the analysis of protein molecules and genetic DNA sequences has been heavily dependent on computational power. "Computational biology" has increasingly become an active new study area because it is necessary for understanding disease mechanisms, improving agricultural resources, and designing drugs and macromolecular materials. Molecular biology databases are rapidly growing in size (10-fold increase per 4 years for the GenBank DNA database) and the entire genomic sequence of several organisms will be completely sequenced in a few years. Even the human genome sequence will be almost fully obtained by the year 2005. A vast amount of calculations are required for the analysis of these huge databases but fortunately large-scale parallelism can be found in the typical cases.

We aim to demonstrate the power of parallel calculation technology by efficiently solving computational biology problems. For this purpose, we have developed the PAPIA (PArallel Protein Information Analysis) system.

PAPIA system consists of these three major elements:

[Go to PAPIA References] [Go to PAPIA System Photo Gallery] [Go to PAPIA Characters]

PAPIA Cluster

RWCP Tsukuba Research Center has been building workstation clusters and PC clusters since 1995 as environments for the research and development of parallel operating systems and parallel programming languages. The series of clusters were designed and implemented by Tezuka, Hori, and Ishikawa in Parallel and Distributed System Software TRC Lab., RWCP.

The RWC PC Cluster II developed in October 1997 is a highly efficient parallel computer with powerful network connections. The PAPIA system was first implemented on the RWC PC Cluster II and was demonstrated at the SC97 conference exhibition. Then, we built a new application-dedicated cluster, RWC PC Cluster IIa (the 'a' stands for application), "PAPIA cluster", in February 1998.

PAPIA Cluster (RWC PC Cluster IIa) is a 64 node PC cluster based on industrial PCs that are connected by Myricom Myrinet. Each node consists of a 200MHz Intel Pentium Pro microprocessor, 256MB memory, 4.1GB hard disk, and a Myrinet gigabit network interface.

The NetBSD operating system is running on each node, and the SCore-D global operating system is running on top of NetBSD. All of the user's parallel processes are created and gang-scheduled by SCore-D. The MPC++ parallel programming language and the MPI (MPICH-PM) message passing library are available for efficient parallel programming.

For more details, see "Clustering Technologies" by Parallel and Distributed System Software TRC Lab., RWCP.

PAPIA Cluster Specifications

Number of Nodes	64 cell nodes + 2 internal monitor nodes
Processor	Intel Pentium Pro microprocessor (200MHz, 8KB L1 cache, 512KB L2 cache)
Memory	256MB EDO DRAM (with ECC) / node
Hard Disk	4.1GB EIDE Disk / node
Network Hardware	Myricom Myrinet (2.56Gbit/sec), 100Base-T Ethernet
Network Driver	PM (developed by RWCP) [Click here for detail]
Local Operating System	NetBSD 1.2
Global Operation System	Score-D (developed by RWCP)
Programming Language	MPC++ (Parallel C++ developed by RWCP), C, C++
Size	W 80cm x D 80cm x H 160cm x 2 Chassis

PAPIA Library and Applications

In order to rapidly and efficiently develop application software for protein analysis, it is important to develop a library of commonly used program modules. However, a common library for protein research did not exist. One reason might be the highly complex semantics and notoriously ill-defined format of the Protein Data Bank (PDB), the principal database of protein tertiary structures.

Onizuka et.al have developed a C++ class library, "PAPIA library", for protein information analysis. In the PAPIA library, protein structures are described hierarchically with clear class definitions. Commonly required calculations, like PDB parsing, geometric rotation, structure matching, sequence alignment, multivariate analysis, and so on are defined as methods of the corresponding classes. We use the PAPIA library for our original academic research in parallel protein information analysis.

Using the PAPIA library, we have also been building an assorted collection of efficient parallel programs, "PAPIA applications", which can be used for typical protein analysis. We have ported this collection of practical applications to the PAPIA cluster.

Significant functions in current PAPIA system are the following three parallel calculations.

Protein Similar Structure Search (best-fit of three dimensional coordinates)
Protein Homologous Sequence Search (dynamic programming)
Protein Multiple Sequence Alignment (combinatorial optimization)

We have also implemented a job submission mechanism using a WWW browser as the user interface for entering query sequences and several parameters. From the WWW, any user can easily submit jobs from remote sites on the Internet. The mechanism consists of the following modules: a) HTML-based input forms as the user interface, b) CGI scripts for submitting and monitoring jobs for the PAPIA system, c) FIFO queues for each service, and d) JAVA and HTML-based graphic output forms as the user interface.

PAPIA References

To publish results which were obtained using the PAPIA system, please cite:

Yutaka Akiyama, Kentaro Onizuka, Tamotsu Noguchi, Makoto Ando: "Parallel Protein Information Analysis (PAPIA) system running on a 64-node PC Cluster.", Proc. the 9th Genome Informatics Workshop (GIW'98), Universal Academy Press, pp.131-140 (1998).