February 16, 2015
The University of Tokyo /
Advanced Industrial Science and Technology(AIST),
Computational Biology Research Center(CBRC)
Wilhelmiina HamalainenAcademy of Finland |
The problem of traditional association rules is that they produce a huge amount of spurious rules but do not necessarily catch any statistically significant dependencies. This talk concerns genuine statistical association rules (dependency rules), which catch the most significant, non-redundant associations in data. It introduces many interpretations of statistical dependence, how to test statistical significance of rule-formed patterns and how to search for genuine rules efficiently. In addition, we will reflect how techniques from frequent association mining could be utilized in genuine pattern discovery.
Koji TsudaUniversity of Tokyo / AIST |
In biology, a trait is often caused via interaction of multiple factors including genetic variants, epigenetic status and environmental conditions. To discover such combinatorial factors from data, pattern mining methods including item set mining are potentially useful, but they have barely been used by experimental biologists due to difficulty in assessing statistical significance. If n elementary factors are available, the number of all combinatorial factors is 2^n -1, preventing conventional multiple testing methods from yielding statistically significant discoveries. Our algorithm, termed limitless arity multiple testing procedure (LAMP), counts the number of testable hypotheses, thereby avoiding the combinatorial explosion problem. It can be used in all kinds of pattern mining algorithms. LAMP discovered a statistically significant combination of as many as eight transcription factors associated with breast cancer, which could not be found by conventional multiple testing methods including FDR-based ones.
■Slide
Yoichiro KamataniRIKEN |
Gene mapping studies for human diseases, especially for common diseases are now widely performed using dense SNP array or sequencing data. Experimental genetic mapping like cross-breeding cannot be done ethically in human being so that statistical analysis plays a central role of these studies. In this talk, I briefly outline statistical analyses applied to this field, then explain what is known and what is not known at present.
■Slide
Mahito SugiyamaOsaka University |
Finding patterns whose occurrence is significantly enriched in a particular class of transactions is a difficult problem because of the need of multiple testing correction. Pruning untestable hypotheses (itemsets) was recently proposed and succeeded in significant itemset mining. An open question, however, is whether this strategy is effective in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. In this talk, we positively answer this question: Combining the strategy and frequent subgraph mining algorithms helps to dramatically reduce the number of candidate subgraphs (the Bonferroni factor) in an efficient manner, resulting in greater statistical power while strictly controlling the FWER for multiple testing.
■Slide
Felipe Llinares LopezETH Zurich |
Given a database with objects belonging to one of two classes, the problem of finding which patterns exhibit a significant statistical association with the class labels is fundamental in many domains including medical research and computational biology. In particular, developing methods which can account for multiple hypothesis testing in a rigorous way while retaining considerable statistical power is essential.
In this talk, Tarone’s improved Bonferroni correction for discrete data, the fundamental pillar of most statistically sound pattern mining algorithms proposed to date, will be introduced in detail. Afterwards, two novel algorithms developed recently in the Machine Learning and Computational Biology Lab at ETH Zurich will be presented. The first, Westfall-Young light, is a new way to find associated patterns based on the Westfall-Young permutation testing procedure which is orders of magnitude more efficient in both runtime and storage than FastWY, the current state-of-the-art. Secondly, the Fast Automatic Interval Search (FAIS) algorithm to detect genomic regions exhibiting genetic heterogeneity will be introduced. Finally, we will discuss some of the current limitations of Tarone’s framework, pointing out the main challenges ahead.
■Slide
James BaileyUniversity of Melbourne |
Mutual information is a very popular measure for assessing the strength of dependency between two features. Despite its attractiveness, mutual information has some limitations: i) it does not have a constant baseline value (average value between random partitions of a dataset) and ii) it is susceptible to selection bias. These issues may lead to inappropriate assessment of dependency strength and misleading feature rankings for classification. In this talk, we discuss the desirability of employing a statistical correction for chance for mutual information and review two enhancements that we have proposed to address these limitations - the adjusted mutual information and the standardized mutual information.
■Slide
Geoff WebbMonash University |
Association discovery is one of the most studied tasks in the field of data mining. However, far more attention has been paid to how to discover associations than to what associations should be discovered. This talk
- highlights shortcomings of the dominant frequent pattern paradigm;
- illustrates benefits of the alternative top-k paradigm; and
- presents the self-sufficient itemsets approach to identifying potentially
interesting associations.
■Slide