LightBlog
Contact at mumbai.academics@gmail.com or 8097636691/9323040215
Responsive Ads Here

Tuesday, 6 February 2018

Discovering Emerging Topics in Social Streams via Link-Anomaly Detection (2014)

Discovering Emerging Topics in Social Streams via Link-Anomaly Detection (2014)

ABSTRACT:
Detection of emerging topics is now receiving renewed interest motivated by the rapid growth of social networks. Conventional-term-frequency-based approaches may not be appropriate in this context, because the information exchanged in social-network posts include not only text but also images, URLs, and videos. We focus on emergence of topics signaled by social aspects of theses networks. Specifically, we focus on mentions of user links between users that are generated dynamically (intentionally or unintentionally) through replies, mentions, and retweets. We propose a probability model of the mentioning behavior of a social network user, and propose to detect the emergence of a new topic from the anomalies measured through the model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social-network posts. We demonstrate our technique in several real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as text-anomaly-based approaches, and in some cases much earlier when the topic is poorly identified by the textual contents in posts.

EXISTING SYSTEM:
Ø A new (emerging) topic is something people feel like discussing, commenting, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words.

DISADVANTAGES OF EXISTING SYSTEM:
A term-frequency-based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be applied when the contents of the messages are mostly nontextual information. On the other hand, the “words” formed by mentions are unique, require little preprocessing to obtain (the information is often separated from the contents), and are available regardless of the nature of the contents.
PROPOSED SYSTEM:
Ø In this paper, we have proposed a new approach to detect the emergence of topics in a social network stream.
Ø The basic idea of our approach is to focus on the social aspect of the posts reflected in the mentioning behavior of users instead of the textual contents.
Ø We have proposed a probability model that captures both the number of mentions per post and the frequency of mentionee.

ADVANTAGES OF PROPOSED SYSTEM:
Ø The proposed method does not rely on the textual contents of social network posts, it is robust to rephrasing and it can be applied to the case where topics are concerned with information other than texts, such as images, video, audio, and so on.
Ø The proposed link-anomaly-based methods performed even better than the keyword-based methods on “NASA” and “BBC” data sets.

MODULES:
1.     Twitter Trends
2.     Training
3.     Aggregate
4.     Change Point Analysis
5.     Burst Detection

MODULES DESCRIPTION:

Twitter Trends:
*       First we design the system using Key-word based detection and Link-based detection. We collected real time data sets from Twitter.
*       Each data set is associated with a list of posts in a service.
*       Also it’s a collaborative service where people can tag Twitter posts that are related to each other and organize a list of posts that belong to a certain topic.
*       Our goal is to evaluate whether the proposed approach can detect the emergence of the topics recognized and collected by people. For each list, we extracted a list of Twitter users that appeared in the list, and collected Twitter posts from those users.

Training:
*       In this section, we describe the probability model that we used to capture the normal mentioning behavior of a user and how to train the model.
*       We characterize a post in a social network stream by the number of mentions k it contains, and the set V of names (IDs) of the mentionees (users who are mentioned in the post).
*       There are two types of infinity we have to take into account here. The first is the number k of users mentioned in a post. Although, in practice a user cannot mention hundreds of other users in a post, we would like to avoid putting an artificial limit on the number of users mentioned in a post.
*       Instead, we will assume a geometric distribution and integrate out the parameter to avoid even an implicit limitation through the parameter. The second type of infinity is the number of users one can possibly mention.

Aggregate:
*       In this module, we describe how to combine the anomaly scores from different users. The anomaly score is computed for each user depending on the current post of user u and his/her past behavior Ttu.
*       To measure the general trend of user behavior, we propose to aggregate the anomaly scores obtained for posts x1;...;x xn using a discretization of window size λ>0.
*       Also, we assign an anomaly score to each post based on the learned probability distribution

Change Point Analysis:
*       This technique is an extension of Change Finder proposed, that detects a change in the statistical dependence structure of a time series by monitoring the compressibility of a new piece of data.
*       Specifically, a change point is detected through two layers of scoring processes. The first layer detects outliers and the second layer detects change-points. In each layer, predictive loss based on the SDNML coding distribution for an autoregressive (AR) model is used as a criterion for scoring. Although the NML code length is known to be optimal, it is often hard to compute.
*       The SNML proposed is an approximation to the NML code length that can be computed in a sequential manner. The SDNML proposed further employs discounting in the learning of the AR models. As a final step in our method, we need to convert the change-point scores into binary alarms by thresholding.
*       Since the distribution of change-point scores may change over time, we need to dynamically adjust the threshold to analyze a sequence over a long period of time. In this subsection, we describe how to dynamically optimize the threshold using the method of dynamic threshold optimization proposed.

Burst Detection:
*       In addition to the change-point detection based on SDNML followed by DTO described in previous sections, we also test the combination of our method with Kleinberg’s burst-detection method.
*       More specifically, we implemented a two-state version of Kleinberg’s burst-detection model. The reason we chose the two-state version was because in this experiment we expect no hierarchical structure.
*       The burst-detection method is based on a probabilistic automaton model with two states, burst state and non-burst state. Some events (e.g., arrival of posts) are assumed to happen according to a time-varying Poisson processes whose rate parameter depends on the current state.

SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
Ø System                          :         Pentium IV 2.4 GHz.
Ø Hard Disk                      :         40 GB.
Ø Floppy Drive                 :         1.44 Mb.
Ø Monitor                         :         15 VGA Colour.
Ø Mouse                            :         Logitech.
Ø Ram                               :         512 Mb.
SOFTWARE REQUIREMENTS:
Ø Operating system           :         Windows XP/7.
Ø Coding Language :         JAVA
Ø IDE                      :         ECLIPSE KEEPLER
REFERENCE:
Toshimitsu Takahashi, Ryota Tomioka, and Kenji Yamanishi, Member, IEEE,“Discovering Emerging Topics in Social Streams via Link-Anomaly Detection”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014

No comments:

Post a Comment