In many information processing tasks, labels are usually expensive and the unlabeled data points are abundant. To reduce the cost on collecting labels, it is crucial to predict which unlabeled examples are the most informative, i.e., improve the classifier the most if they were labeled. Many active learning techniques have been proposed for text categorization, […]
Learning Bregman Distance Functions for Semi-Supervised Clustering
Learning distance functions with side information plays a key role in many data mining applications. Conventional distance metric learning approaches often assume that the target distance function is represented in some form of Mahalanobis distance. These approaches usually work well when data are in low dimensionality, but often become computationally expensive or even infeasible when […]
Learning a Propagable Graph for Semisupervised Learning: Classification and Regression
we present a novel framework, called learning by propagability, for two essential data mining tasks, i.e., classification and regression. The whole learning process is driven by the philosophy that the data labels and the optimal feature representation jointly constitute a harmonic system, where the data labels are invariant with respect to the propagation on the […]
Labeling Dynamic XML Documents: An Order-Centric Approach
Dynamic XML labeling schemes have important applications in XML Database Management Systems. In this paper, we explore dynamic XML labeling schemes from a novel order-centric perspective. We compare the various labeling schemes proposed in the literature with a special focus on their orders of labels. We show that the order of labels fundamentally impacts the […]
Incremental Information Extraction Using Relational Databases
Information extraction systems are traditionally implemented as a pipeline of special-purpose processing modules targeting the extraction of a particular kind of information. A major drawback of such an approach is that whenever a new extraction goal emerges or a module is improved, extraction has to be reapplied from scratch to the entire text corpus even […]
Identifying Evolving Groups in Dynamic Multimode Networks
A multimode network consists of heterogeneous types of actors with various interactions occurring between them. Identifying communities in a multimode network can help understand the structural properties of the network, address the data shortage and unbalanced problems, and assist tasks like targeted marketing and finding influential actors within or between groups. In general, a network […]
Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis
Preparing a data set for analysis is generally the most time consuming task in a data mining project, requiring many complex SQL queries, joining tables, and aggregating columns. Existing SQL aggregations have limitations to prepare data sets because they return one column per aggregated group. In general, a significant manual effort is required to build […]
Holistic Top-k Simple Shortest Path Join in Graphs
Motivated by the needs such as group relationship analysis, this paper introduces a new operation on graphs, named top-k path join, which discovers the top-k simple shortest paths between two given node sets. Rather than discovering the top-k simple paths between each node pair, this paper proposes a holistic join method which answers the top-k […]
Fuzzy Orders-of-Magnitude-Based Link Analysis for Qualitative Alias Detection
Alias detection has been the significant subject being extensively studied for several domain applications, especially intelligence data analysis. Many preliminary methods rely on text-based measures, which are ineffective with false descriptions of terrorists’ name, date-of-birth, and address. This barrier may be overcome through link information presented in relationships among objects of interests. Several numerical link-based […]
Fractal-Based Intrinsic Dimension Estimation and Its Application in Dimensionality Reduction
Dimensionality reduction is an important step in knowledge discovery in databases. Intrinsic dimension indicates the number of variables necessary to describe a data set. Two methods, box-counting dimension and correlation dimension, are commonly used for intrinsic dimension estimation. However, the robustness of these two methods has not been rigorously studied. This paper demonstrates that correlation […]
- « Previous Page
- 1
- …
- 80
- 81
- 82
- 83
- 84
- …
- 108
- Next Page »