September 11(Wed) - 13(Fri), 2013
The Annual Workshop of Computational Biology Research Center, AIST
Paul Horton (Principal Research Manager, CBRC, AIST)
"Advances in Predicting Protein Sub-cellular Localization Signals"
The prediction of protein sub-cellular localization signals is an old problem in bioinformatics, but much remains to be solved.
In this talk I will describe advances in two areas: nucleo-cytosolic sorting signals and mitochondrial sorting signals.
For mitochondrial matrix targeting sorting signals and cleavage site prediction I will describe our novel "mitofates" method, which is marginally better for signal prediction and significantly better for cleavage site prediction.
For nucleo-cytosolic sorting signals I will describe NESsential, our predictor for leucine-rich NES's, and ValidNES a database for those signals. Furthermore I will describe a collaboration with Naoko Imamoto's group in which we characterized the difference between nuclear localization signals recognized by different carriers and our plans for NLS-NESdb, which we hope will become the standard resource for information on these signals.
Yutaka Saito (RNA Informatics Team, CBRC, AIST)
"Bisulfighter: accurate detection of methylated cytosines and differentially methylated regions"
We present Bisulfighter, a new software package for detecting mCs and DMRs from bisulfite sequencing data. Bisulfighter combines the LAST alignment tool for mC calling, and a novel framework for DMR detection based on hidden Markov models (HMMs). Unlike previous attempts that depend on empirical parameters, Bisulfighter can utilize the expectation-maximization algorithm for HMMs to adjust parameters for each dataset. We conduct extensive experiments in which accuracy of mC calling and DMR detection is evaluated on simulated data with various mC contexts, read qualities, sequencing depths, and DMR lengths, as well as on real data from a wide range of biological processes including pathogenesis and normal development. We demonstrate that Bisulfighter consistently achieves better accuracy than other published tools, providing greater sensitivity for mCs with fewer false positives, more precise estimates of mC levels, more exact locations of DMRs, and better agreement of DMRs with gene expression and DNase I hypersensitivity.
Kazutaka Katoh (IFReC, Osaka University)
"Multiple sequence alignment of a large number of sequences"
Along with the progress of sequencing technologies, multiple sequence alignment (MSA) with larger number of sequences is becoming necessary. I discuss two different approaches for constructing a large MSA. (1) In a standard progressive method, the time limiting factor is the construction of guide tree. The parttree option of MAFFT computes an approximate guide tree using a recursive algorithm. It can be applied to larger datasets than conventional methods, but the quality of resulting alignment is reduced by this approximation. (2) With the increase of the number of sequences in an MSA, the similarity among the sequences becomes higher, since an MSA usually consists of homologous sequences. Accordingly, we can build a backbone alignment, using a moderate number of representative sequences only, and then add the remaining sequences into the backbone. It is expected that the latter step requires alignment of closely related sequence only, if appropriately representatives are selected.
Anish MS Shrestha (Department of Computational Biology, University of Tokyo)
"Mapping paired-end reads to a reference genome"
Many high-throughput sequencing experiments produce paired DNA reads.
Paired-end DNA reads provide extra positional information that is useful in reliable mapping of short reads to a reference genome as well as in downstream analyses of structural variations. We present a new probabilistic framework to predict the alignment of paired-end reads to a reference genome. Using both simulated and real data, we compared the performance of our method against six other read-mapping tools that provide a paired-end option. Our method provides a good combination of accuracy, error rate, and computation time, especially in more challenging and practical cases such as when the reference genome is incomplete or unavailable for the sample, or when there are large variations between the reference genome and the source of the reads. An open-source implementation of our method is available as part of LAST, a multi-purpose alignment program freely available at
Martin Frith (Sequence Analysis Team, CBRC, AIST)
"Explaining the correlated properties of mammalian promoters"
Proximal promoters vary in terms of: CpG abundance, TATAs, evolutionary conservation, spread of transcription start sites, and breadth of expression across cell types. These properties are correlated, and it was suggested that there are two promoter classes: one with high-CpG, widely-spread start sites, and broad expression, and another with TATAs, narrow spread and restricted expression. It has been unclear, however, why these properties are correlated.
We re-examined these features using the FANTOM5 CAGE data from hundreds of cell types. Firstly, we describe biases in previous definitions of promoters and expression breadth. Secondly, we show that most promoters are rather non-specifically expressed across many cell types. Thirdly, promoters’ expression breadth is independent of maximum expression level, and therefore correlates with average expression level. Fourthly, the data show a network of direct and indirect correlations among promoter properties. By distinguishing the direct from the indirect correlations, we reveal simple explanations for them.
馬場 嘉信 (名古屋大学大学院工学研究科・革新ナノバイオデバイス研究センター、産総研 健康工学研究部門)
文献: 現代化学, 2013年3月号; Nature Biotech., 22, 337 (2004); Nature Biotech., 22, 1360 (2004); ACS Nano., 4, 121 (2010); ACS Nano., 5, 493, (2011); ACS Nano, 5, 7775 (2011); ACS Nano, 5, 9264 (2011); Nano Lett., 12, 6145 (2012); ACS Nano, 7, 3029 (2013); Nano Lett., in press (2013).
笹井 理生 (名古屋大学 大学院工学研究科)
また、真核生物の遺伝子発現は、核内におけるクロマチンの立体配置にも大きく影響を受けると考えられる。最近のChromatin Conformation Capture (3C) 技術に基づく方法から、間期の核内の染色体の異なる２点間の接近頻度がゲノムワイドに推定されているため、この3C技術に基づくデータから、核内のゲノム立体構造の平均構造を推定することができ、さらに、動力学計算を行うことができる。本講演では、最も詳しく調べられている例として出芽酵母を例に取り、ゲノム動力学計算の様子を紹介する。染色体は核内を大きく確率的に運動するが、それぞれの遺伝子の位置がランダムに混ざるわけではなく、遺伝子のテリトリーとも呼ぶべき領域を形成しながら運動していることが理解できる。
Koji Tsuda (Leader, Machine Learning Research Group, CBRC, AIST)
"Statistical Significance of Combinatorial Regulations"
More than three transcription factors often work together to enable cells to respond to various signals. The detection of combinatorial regulation by multiple transcription factors, however, is not only computationally nontrivial but also extremely unlikely because of multiple testing correction. The exponential growth in the number of tests forces us to set a strict limit on the maximum arity. Here, we propose an efficient branch-and-bound algorithm called the “limitless arity multiple-testing procedure” (LAMP) to count the exact number of testable combinations and calibrate the Bonferroni factor to the smallest possible value. LAMP lists significant combinations without any limit, whereas the family-wise error rate is rigorously controlled under the threshold. In the human breast cancer transcriptome, LAMP discovered statistically significant combinations of as many as eight binding motifs. This method may contribute to uncover pathways regulated in a coordinated fashion and find hidden associations in heterogeneous data.
○Kana Shimizu1, Koji Nuida2, Hiromi Arai3, Shigeo Mitsunari4, Michiaki Hamada1,5, Koji Tsuda1, Jun Sakuma6, Takatsugu Hirokawa1,7, Goichiro Hanaoka2, Kiyoshi Asai1, 5
(1 CBRC， AIST, 2 RISEC，AIST, 3 RIKEN, 4 Cybozu Labs, 5 University of Tokyo, 6 Tsukuba University, 7 Molprof，AIST)
"An efficient privacy-preserving similarity search protocol for chemical compound databases"
Searching similar compound from a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. There arises serious dilemma, however, when the database holder also wants to output no information except for the results of the search, and such dilemma prevents efficient use of many important databases. Therefore, it is emerging demand to develop the new technology which overcome this dilemma. In this study, we propose a novel protocol which enables searching databases while keeping both a query holder's privacy and a database holder's privacy. Generally, such a privacy-preserving protocol entails highly time-consuming cryptographic techniques such as general pu! rpose multi-party computation, but our protocol is successfully designed without relying on such techniques and built from only additive-homomorphic cryptosystem. Hence its performance is significantly efficient both in CPU time and communication size, easily scales for large scale databases. In the experiment searching on ChEMBL, which consists of more than 1,200,000 compounds, the proposed method is 50,000 times faster in CPU time and 12,000 times efficient in communication size comparing to general purpose multi-party computation. So far, technology related to privacy issues has been scarcely discussed in the field of bioinformatics, thus, we think our study serves as the important model which examines practical application of privacy-preserving datamining.
Chie Motono1, 2, Junichi Nakata2, Ryotaro Koike3, Kana Shimizu2, Matsuyuki Shirota4, Takayuki Amemiya5, Kentaro Tomii2, Nozomi Nagano2, Naofumi Sakaya2, 6, Kiyotaka Misoo6, Miwa Sato1, 5, 7, Akinori Kidera5, 8, Hidekazu Hiroaki9, Tsuyoshi Shirai10, Kengo Kinoshita4, Tamotsu Noguchi2, Motonori Ota3,
(1 Molecular Profiling Research Center for Drug Discovery, National Institute of Advanced Industrial Science and Technology (AIST), 2 Computational Biology Research Center (CBRC), AIST, 3 Graduate School of Information Science, Nagoya University, 4 Graduate School of Information Science, Tohoku University, 5 Department of Supramolecular Biology, Yokohama City University, 6 Information and Mathematical Science Laboratory Inc., 7 Mitsui Knowledge Industry Co., Ltd., 8 Department of Computational Science Research Program, RIKEN, 9 Graduate School of Pharmaceutical Sciences, Nagoya University, 10 Department of Bioscience, Nagahama Institute of Bioscience and Technology)
"SAHG, a comprehensive database of predicted structures of all human proteins"
We have constructed a novel database, SAHG, Structural Atlas of Human Genome (http://bird.cbrc.jp/sahg/), which exhibits protein structure models encoded in the human genome. All of the Open Reading Frames in the human genome were subjected to a fully automated, exhaustive protein-structure-prediction pipeline to generate protein structure models. SAHG contains 42,577 domain-structure models in ~24800 unique human protein sequences from the Refseq database.
Most of proteins from higher organisms are known to be multi-domain proteins and contain substantial number of intrinsically disordered (ID) regions. To analyze such protein sequences, we developed structure prediction methods by the combination of various alignment methods (BLAST, PSI-BLAST, the Smith-Waterman profile-profile alignment, FORTE and the probabilistic profile-profile alignment), a prediction tool for ID regions (POODLE), and a modeling tool MODELLER. SAHG also provides annotation of models with predicted protein-protein and protein-ligand interactions and links to other databases such as EzCatDB, InterPro, or HPRD.
Natsuko ICHIKAWA, Machi SASAGAWA, Mika YAMAMOTO, Hisayuki KOMAKI, and Nobuyuki FUJITA
(Biological Resource Center, NITE (NBRC))
"Development of manually curated database of secondary metabolite biosynthesis gene clusters"
Secondary metabolites produced by bacteria have pharmacologically important activities and attract attentions as lead-compounds and/or candidates for drug development. Biosynthesis of structurally diverse secondary metabolites is mainly catalyzed by a number of enzymes encoded in large gene clusters spanning 10 kb ~ 100 kb. Recent development of combinatorial biosynthesis technology enabled us to obtain novel compounds by heterologous expression and genetic manipulation. However, it is difficult to gain information about a particular enzyme reaction. Information about even a single gene cluster is often dispersed in many references and is not described in a comprehensive manner. There is no database integrating such dispersed information of whole biosynthesis clusters.
We developed integrative database focused on known secondary metabolite biosynthetic gene clusters. Our database can provide functionally classified and updated knowledge of each gene of known biosynthesis clusters, which enable users to identify candidates with application potency from the gene clusters.
Toshiaki Katayama (DBCLS, Research Organization of Information and Systems)
"International collaborations for semantic data integration and interoperability"
National Bioscience Database Center (NBDC) and Database Center for Life Science (DBCLS) have organized annual BioHackathons since 2008. BioHackathon is a unique international workshop in which working-level domestic participants and invited foreign developers meet face-to-face for resolving current issues in life sciences on site. Each year, developers of major bioinformatics databases, services and tools gathered for one week of intensive discussions and software development to improve data integration and interoperability of biomedical infrastructures. To achieve this goal, we are focusing on the utilization of Semantic Web technologies including generation of RDF data, development of ontologies, and integration of distributed SPARQL endpoints. As a consequence, new open source software and community driven ontologies have been developed, which in turn are utilized for providing various biomedical services. Because data in the life science domain are often bulky and heterogeneous, it is important to gain community agreement on the semantics and exchange format of data for elevating interoperability. This kind of standardization requires long-term continuous effort and BioHackathons have provided practical opportunities for assisting international collaborations.
(1The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine. 2Department of Physiology and Biophysics, The Weill Cornell Medical College, New York, NY, USA. )
"GobyWeb: simplified management and analysis of gene expression and DNA methylation sequencing data"
Abstract: The talk will discuss GobyWeb, a web-based system that we have developed to facilitate the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to previous analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. This talk will present the design of the system, the challenges encountered during its development and recent developments. GobyWeb can be obtained at
Jennifer Russo Wortman
(Microbial Informatics, Broad Institute)
"Keeping up with the next generation: bioinformatics approaches and challenges"
Massively parallel sequencing (MPS) technology has dramatically reduced the cost of genome sequencing in recent years, making large scale genomic, transcriptomic, and metagenomic data affordable for a wide range of biological researchers. However, sequencing accessibility has not been accompanied by equal availability of tools for assembly, annotation and analysis. At the Broad Institute, we have extensive experience generating and analyzing large scale -omics data, and have developed and collaborated on a number of open source tools to help enable the larger scientific community. This talk describes tools for microbial genome assembly and analysis, transcriptome assembly and analysis, and metagenomic data interpretation.