Contact at mumbai.academics@gmail.com or 8097636691
Responsive Ads Here

Saturday, 3 February 2018

Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection(2015)



Reverse Nearest Neighbors in

 Unsupervised Distance-Based Outlier

 Detection(2015)

ABSTRACT:
Outlier detection in high-dimensional data presents various challenges resulting from the “curse of dimensionality.” A prevailing view is that distance concentration, i.e., the tendency of distances in high-dimensional data to become indiscernible, hinders the detection of outliers by making distance-based methods label all points as almost equally good outliers. In this paper, we provide evidence supporting the opinion that such a view is too simple, by demonstrating that distance-based methods can produce more contrasting outlier scores in high-dimensional settings. Furthermore, we show that high dimensionality can have a different impact, by reexamining the notion of reverse nearest neighbors in the unsupervised outlier-detection context. Namely, it was recently observed that the distribution of points’ reverse-neighbor counts becomes skewed in high dimensions, resulting in the phenomenon known as hubness. We provide insight into how some points (antihubs) appear very infrequently in k-NN lists of other points, and explain the connection between antihubs, outliers, and existing unsupervised outlier-detection methods. By evaluating the classic k-NN method, the angle-based technique designed for high-dimensional data, the density-based local outlier factor and influenced outlierness methods, and antihub-based methods on various synthetic and real-world data sets, we offer novel insight into the usefulness of reverse neighbor counts in unsupervised outlier detection.
EXISTING SYSTEM:
 The task of detecting outliers can be categorized as supervised, semi-supervised, and unsupervised, depending on the existence of labels for outliers and/or regular instances. Among these categories, unsupervised methods are more widely applied because the other categories require accurate and representative labels that are often prohibitively expensive to obtain.
 Unsupervised methods include distance-based methods that mainly rely on a measure of distance or similarity in order to detect outliers. A commonly accepted opinion is that, due to the “curse of dimensionality,” distance becomes meaningless, since distance measures concentrate, i.e., pair wise distances become indiscernible as dimensionality increases.
 The effect of distance concentration on unsupervised outlier detection was implied to be that every point in high-dimensional space becomes an almost equally good
DISADVANTAGES OF EXISTING SYSTEM:
  •  Distinguishes three problems brought by the “curse of dimensionality” in the general context of search, indexing, and data mining applications: poor discrimination of distances caused by concentration, presence of irrelevant attributes, and presence of redundant attributes, all of which hinder the usability of traditional distance and similarity measures.
  •  The authors conclude that despite such limitations, common distance/similarity measures still form a good foundation for secondary measures, such as shared-neighbor distances, which are less sensitive to the negative effects of the curse.
PROPOSED SYSTEM:
  •  It is crucial to understand how the increase of dimensionality impacts outlier detection. As explained in the actual challenges posed by the “curse of dimensionality” differs from the commonly accepted view that every point becomes an almost equally good outlier in high-dimensional space. We will present further evidence which challenges this view, motivating the (re)examination of methods.
  •  Reverse nearest-neighbor counts have been proposed in the past as a method for expressing outlierness of data points but no insight apart from basic intuition was offered as to why these counts should represent meaningful outlier scores. Recent observations that reverse-neighbor counts are affected by increased dimensionality of data warrant their re examination for the outlier-detection task. In this light, we will revisit the ODIN method.
ADVANTAGES OF PROPOSED SYSTEM:
  •  Demonstration of one plausible scenario where the methods based on antihubs are expected to perform well, which is in a set-ting involving clusters of different densities. For this reason, we use synthetic data in order to control data density and dimensionality. 
MODULES:
1. Pre-processing Module
2. Antihubs Calculations
3. Outlier Detection Based On Antihubs
4. Evaluation Result
MODULES DESCSRIPTION:
Pre-processing Module
In this module, have to pre-process the dataset. Here the adult dataset is taken. Train and test dataset is considered. This data set consists of 15 attributes (class attribute). The prediction task associated with the Adult data set is to determine based on census and demographic information about people. The data set contains both categorical and numerical attributes. Although the Age attribute in the Adult data set is numerical. And they filter the incomplete records from the adult dataset. We chose the setting involving uniformly distributed random points because of the intuitive expectation that it should not contain any really prominent outliers. Analogous observations can be made with other data distributions, numbers of drawn points, and distance measures. The demonstrated behavior is actually an inherent consequence of increasing dimensionality of data, with the tendency of the detected prominent outliers to come from the set of antihubs—points that appear in very few, if any, nearest neighbor lists of other points in the data.
Antihubs Calculations
In this module, we observe antihubs as a special category of points in high-dimensional spaces. We explain the reasons behind the emergence of antihubs and examine their relation to outliers detected by unsupervised methods in the context of varying neighborhood size k. Finally, we explore the interplay of hubness and data sparsity. The existence of antihubs is a direct consequence of high dimensionality when neighborhood size k is small compared to the size of the data. Distance concentration refers to the tendency of distances in high-dimensional data to become almost indiscernible as dimensionality increases, and is usually expressed through a ratio of a notion of spread (e.g., standard deviation) and magnitude (e.g., the expected value) of the distribution of distances of all points in a data set to some reference point. If this ratio tends to 0 as dimensionality goes to infinity, it is said that distances concentrate. Considering random data with iid coordinates and Euclidean distance, concentration is reflected in the fact that, as dimensionality increases, the standard deviation of the distribution of distances remains constant, while the mean value continues to grow. More visually it can be said that, as dimensionality increases, all points tend to lie approximately on a hypersphere centered at the reference point, whose radius is the mean distance.
Outlier Detection Based On Antihubs
This module is used to measure the success of the method in removing all evidence of direct and/or indirect discrimination from the original data set; on the other hand, have to measure the impact of the method in terms of information loss. The technique proposed for identifying outliers will be applied initially at distributed clients and their results of detected outliers would be integrated on server machine at final stage computation of outliers. To do this, the outlier detection strategies proposed are KNN Algorithm with ABOD and INFLO Method.
Evaluation Result
In this module, have to evaluate the whole result and also displayed in table. The above mentioned methods have to evaluate and finally they have to present with the parameter alpha which is the fixed value. Also, the outlier detected by above approach will be evaluated on the basis of set evaluation parameters for their performance evaluation. The performance evaluation will also provide details about implemented system performance metrics and constraints. With the help of proper visualization of results, the system execution will be made more understandable and explorative for its evaluators.
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
 System                          :         Pentium IV 2.4 GHz.
 Hard Disk                      :         40 GB.
 Floppy Drive                 :         1.44 Mb.
 Monitor                         :         15 VGA Colour.
 Mouse                            :         Logitech.
 Ram                               :         512 Mb.
SOFTWARE REQUIREMENTS:
 Operating system           :         Windows XP/7.
 Coding Language :         JAVA/J2EE
 IDE                      :         Netbeans 7.4
 Database              :         MYSQL
REFERENCE:

No comments:

Post a Comment