Wilhelmiina Hamalainen


Academy of Finland

 Mining statistically significant association rules

The problem of traditional association rules is that they produce a huge amount of spurious rules but do not necessarily catch any statistically significant dependencies. This talk concerns genuine statistical association rules (dependency rules), which catch the most significant, non-redundant associations in data. It introduces many interpretations of statistical dependence, how to test statistical significance of rule-formed patterns and how to search for genuine rules efficiently. In addition, we will reflect how techniques from frequent association mining could be utilized in genuine pattern discovery.




Koji Tsuda


University of Tokyo / AIST

 Limitless Arity Multiple Testing Procedures for Combinatorial Hypotheses

 In biology, a trait is often caused via interaction of multiple factors including genetic variants, epigenetic status and environmental conditions. To discover such combinatorial factors from data, pattern mining methods including item set mining are potentially useful, but they have barely been used by experimental biologists due to difficulty in assessing statistical significance. If n elementary factors are available, the number of all combinatorial factors is 2^n -1, preventing conventional multiple testing methods from yielding statistically significant discoveries. Our algorithm, termed limitless arity multiple testing procedure (LAMP), counts the number of testable hypotheses, thereby avoiding the combinatorial explosion problem. It can be used in all kinds of pattern mining algorithms. LAMP discovered a statistically significant combination of as many as eight transcription factors associated with breast cancer, which could not be found by conventional multiple testing methods including FDR-based ones.
■Slide




Yoichiro Kamatani


RIKEN

 Statistical analyses used for gene mapping of human diseases

 Gene mapping studies for human diseases, especially for common diseases are now widely performed using dense SNP array or sequencing data. Experimental genetic mapping like cross-breeding cannot be done ethically in human being so that statistical analysis plays a central role of these studies. In this talk, I briefly outline statistical analyses applied to this field, then explain what is known and what is not known at present.  
■Slide




Mahito Sugiyama


Osaka University

 Multiple Testing Correction in Graph Mining

 Finding patterns whose occurrence is significantly enriched in a particular class of transactions is a difficult problem because of the need of multiple testing correction. Pruning untestable hypotheses (itemsets) was recently proposed and succeeded in significant itemset mining. An open question, however, is whether this strategy is effective in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. In this talk, we positively answer this question: Combining the strategy and frequent subgraph mining algorithms helps to dramatically reduce the number of candidate subgraphs (the Bonferroni factor) in an efficient manner, resulting in greater statistical power while strictly controlling the FWER for multiple testing.
■Slide




Felipe Llinares Lopez


ETH Zurich 

 Exploiting discrete test statistics for significant pattern mining:
 theory and applications

 Given a database with objects belonging to one of two classes, the problem of finding which patterns exhibit a significant statistical association with the class labels is fundamental in many domains including medical research and computational biology. In particular, developing methods which can account for multiple hypothesis testing in a rigorous way while retaining considerable statistical power is essential.

 In this talk, Tarone’s improved Bonferroni correction for discrete data, the fundamental pillar of most statistically sound pattern mining algorithms proposed to date, will be introduced in detail. Afterwards, two novel algorithms developed recently in the Machine Learning and Computational Biology Lab at ETH Zurich will be presented. The first, Westfall-Young light, is a new way to find associated patterns based on the Westfall-Young permutation testing procedure which is orders of magnitude more efficient in both runtime and storage than FastWY, the current state-of-the-art. Secondly, the Fast Automatic Interval Search (FAIS) algorithm to detect genomic regions exhibiting genetic heterogeneity will be introduced. Finally, we will discuss some of the current limitations of Tarone’s framework, pointing out the main challenges ahead.
■Slide




James Bailey


University of Melbourne

 Statistically correcting for chance using the adjusted and standardized mutual
 information measures

  Mutual information is a very popular measure for assessing the strength of dependency between two features. Despite its attractiveness, mutual information has some limitations: i) it does not have a constant baseline value (average value between random partitions of a dataset) and ii) it is susceptible to selection bias. These issues may lead to inappropriate assessment of dependency strength and misleading feature rankings for classification. In this talk, we discuss the desirability of employing a statistical correction for chance for mutual information and review two enhancements that we have proposed to address these limitations - the adjusted mutual information and the standardized mutual information.
■Slide




Geoff Webb


Monash University

 Finding Interesting Patterns

Association discovery is one of the most studied tasks in the field of data mining. However, far more attention has been paid to how to discover associations than to what associations should be discovered. This talk
- highlights shortcomings of the dominant frequent pattern paradigm;
- illustrates benefits of the alternative top-k paradigm; and
- presents the self-sufficient itemsets approach to identifying potentially
interesting associations.
■Slide




TO TOP