Expected similarity estimation for large-scale anomaly detection
Auch gedruckt in der BibliothekW: W-H 14.984
FakultätFakultät für Ingenieurwissenschaften, Informatik und Psychologie
InstitutionInstitut für Neuroinformatik
Ressourcen- / MedientypDissertation, Text
Datum der Erstveröffentlichung2017-02-03
Anomalies are patterns in data or events which are unlikely to appear under normal conditions. It is of central interest to detect such anomalous instances to prevent damage or to extract valuable information from data. While statistics and machine learning developed several excellent key techniques to perform anomaly detection, most of them suffer poor algorithmic scalability when applied to large-scale datasets since the computational complexity and memory requirements become the limiting factor of these algorithms. This dissertation makes several contributions to the problem of large-scale anomaly detection centered on a novel method we introduce named EXPoSE which estimates the similarity between a new unseen observation and the distribution data under normal conditions. That way EXPoSE measures the likelihood for a new observation to be anomalous. Its core is based on the kernel embedding of distributions which maps a probability measure into a reproducing kernel Hilbert space where it can be manipulated efficiently. The kernel embedding representation requires no parametric assumptions or explicit description of the probability measure. This constitutes an important advantage since the distributions of normal and anomalous instances are in general unknown. The main contributions of this work are efficient algorithms to train and evaluate the EXPoSE anomaly detector. This can be achieved with computational complexities and memory requirements independent of the dataset size which is the key to solve large-scale machine learning problems. The dependence on the reproducing kernel function as a similarity measure enables the application to many domains and introduces the possibility to incorporate domain and expert knowledge into the modeling process. The key technologies are further developed to online and streaming anomaly detection where instances arrive in a possible infinite sequence of observations. A crucial requirement in these applications is the ability to make predictions as data arrive based on the information obtained from previous observations. One of the major challenges is the non-stationary nature of streams in which our understanding of what is normal and anomalous change over time. This introduces the necessity to adapt to such changes e.g. by forgetting outdated information while incorporating new knowledge. The simplicity of the proposed methodologies facilitates a theoretical analysis to provide guarantees in terms of convergence rates and probabilistic bounds.