Tokyo Workshop on Statistically Sound Data Mining

Wilhelmiina Hamalainen

Academy of Finland

　Mining statistically significant association rules

The problem of traditional association rules is that they produce a huge amount of spurious rules but do not necessarily catch any statistically significant dependencies. This talk concerns genuine statistical association rules (dependency rules), which catch the most significant, non-redundant associations in data. It introduces many interpretations of statistical dependence, how to test statistical significance of rule-formed patterns and how to search for genuine rules efficiently. In addition, we will reflect how techniques from frequent association mining could be utilized in genuine pattern discovery.

Koji Tsuda

University of Tokyo / AIST

　Limitless Arity Multiple Testing Procedures for Combinatorial Hypotheses

　In biology, a trait is often caused via interaction of multiple factors including genetic variants, epigenetic status and environmental conditions. To discover such combinatorial factors from data, pattern mining methods including item set mining are potentially useful, but they have barely been used by experimental biologists due to difficulty in assessing statistical significance. If n elementary factors are available, the number of all combinatorial factors is 2^n -1, preventing conventional multiple testing methods from yielding statistically significant discoveries. Our algorithm, termed limitless arity multiple testing procedure (LAMP), counts the number of testable hypotheses, thereby avoiding the combinatorial explosion problem. It can be used in all kinds of pattern mining algorithms. LAMP discovered a statistically significant combination of as many as eight transcription factors associated with breast cancer, which could not be found by conventional multiple testing methods including FDR-based ones.
■Slide

Yoichiro Kamatani

RIKEN

　Statistical analyses used for gene mapping of human diseases

　Gene mapping studies for human diseases, especially for common　diseases are now widely performed using dense SNP array or sequencing data. Experimental genetic mapping like cross-breeding cannot be done ethically in human being so that statistical analysis plays a central role of these studies. In this talk, I briefly outline statistical analyses applied to this field, then explain what is known and what is not known at present. 　
■Slide

Mahito Sugiyama

Osaka University

　Multiple Testing Correction in Graph Mining

　Finding patterns whose occurrence is significantly enriched in a particular class of transactions is a difficult problem because of the need of multiple testing correction. Pruning untestable hypotheses (itemsets) was recently proposed and succeeded in significant itemset mining. An open question, however, is whether this strategy is effective in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. In this talk, we positively answer this question: Combining the strategy and frequent subgraph mining algorithms helps to dramatically reduce the number of candidate subgraphs (the Bonferroni factor) in an efficient manner, resulting in greater statistical power while strictly controlling the FWER for multiple testing.
■Slide

Felipe Llinares Lopez

ETH Zurich　

　Exploiting discrete test statistics for significant pattern mining:
　theory and applications

　Given a database with objects belonging to one of two classes, the problem of finding which patterns exhibit a significant statistical association with the class labels is fundamental in many domains including medical research and computational biology. In particular, developing methods which can account for multiple hypothesis testing in a rigorous way while retaining considerable statistical power is essential.

　In this talk, Tarone’s improved Bonferroni correction for discrete data, the fundamental pillar of most statistically sound pattern mining algorithms proposed to date, will be introduced in detail. Afterwards, two novel algorithms developed recently in the Machine Learning and Computational Biology Lab at ETH Zurich will be presented. The first, Westfall-Young light, is a new way to find associated patterns based on the Westfall-Young permutation testing procedure which is orders of magnitude more efficient in both runtime and storage than FastWY, the current state-of-the-art. Secondly, the Fast Automatic Interval Search (FAIS) algorithm to detect genomic regions exhibiting genetic heterogeneity will be introduced. Finally, we will discuss some of the current limitations of Tarone’s framework, pointing out the main challenges ahead.
■Slide

James Bailey

University of Melbourne

　Statistically correcting for chance using the adjusted and standardized mutual
　information measures

　 Mutual information is a very popular measure for assessing the strength of dependency between two features. Despite its attractiveness, mutual information has some limitations: i) it does not have a constant baseline value (average value between random partitions of a dataset) and ii) it is susceptible to selection bias. These issues may lead to inappropriate assessment of dependency strength and misleading feature rankings for classification. In this talk, we discuss the desirability of employing a statistical correction for chance for mutual information and review two enhancements that we have proposed to address these limitations - the adjusted mutual information and the standardized mutual information.
■Slide

Geoff Webb

Monash University

　Finding Interesting Patterns

Association discovery is one of the most studied tasks in the field of data mining. However, far more attention has been paid to how to discover associations than to what associations should be discovered. This talk
- highlights shortcomings of the dominant frequent pattern paradigm;
- illustrates benefits of the alternative top-k paradigm; and
- presents the self-sufficient itemsets approach to identifying potentially
interesting associations.
■Slide

TO TOP

Tokyo Workshop on Statistically Sound Data Mining

Wilhelmiina Hamalainen

Mining statistically significant association rules

Koji Tsuda

Limitless Arity Multiple Testing Procedures for Combinatorial Hypotheses

Yoichiro Kamatani

Statistical analyses used for gene mapping of human diseases

Mahito Sugiyama

Multiple Testing Correction in Graph Mining

Felipe Llinares Lopez

Exploiting discrete test statistics for significant pattern mining: theory and applications

James Bailey

Statistically correcting for chance using the adjusted and standardized mutual information measures

Geoff Webb

Finding Interesting Patterns

Tokyo Workshop on Statistically
Sound Data Mining

　Mining statistically significant association rules

　Limitless Arity Multiple Testing Procedures for Combinatorial Hypotheses

　Statistical analyses used for gene mapping of human diseases

　Multiple Testing Correction in Graph Mining

　Exploiting discrete test statistics for significant pattern mining:
　theory and applications

　Statistically correcting for chance using the adjusted and standardized mutual
　information measures

　Finding Interesting Patterns