2009年3月31日

Latent Dirichlet allocation

"Latent Dirichlet allocation," D. Blei, A. Ng, and M. Jordan. . Journal of Machine Learning Research, 3:993–1022, January 2003

Latent Dirichlet allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. The basic idea of LDA is that documents are represented as random mixture distributions over latent topics, where topics generate words by fixed conditional distribution and those topics are infinitely exchangeable within a document. Therefore, compared with other latent topic models, LDA overcomes limiting assumption in mixture of unigrams and overfitting problem in pLSI by treating the topic mixture weights as a k-parameter hidden random variable and gets a smooth distribution on the topic simplex. Besides, LDA finds the optimal variational parameters by KL-divergence and applies variational EM algorithm to get approximate empirical Bayes estimates, alpha and beta. In conclusion, LDA is a simple model for dimensionality reduction and has modularity and extensibility for more application.

Probabilistic Latent Semantic Indexing

“Probabilistic Latent Semantic Indexing”, Thomas Hofmann, SIGIR, 1999

Probabilistic Latent Semantic Analysis(PLSA) is a novel method for automated indexing based on the likelihood principle. PLSA defines a statistical latent class model called aspect model fitting with tempered EM algorithm. In contrast to LSA which determines the optimal decomposition by L2-norm, PLSA relies on the likelihood function of multinomial sampling and aims at an explicit maximization of the predictive power of the model which has a clear probabilistic meaning in terms of mixture component distributions. PLSA succeeds in dealing with the potential impreciseness of user queries by detecting the synonyms and polysemous words.

2009年3月27日

Contour and Texture Analysis for Image Segmentation

Title: Contour and Texture Analysis for Image Segmentation
Author: JITENDRA MALIK, SERGE BELONGIE, THOMAS LEUNG∗AND JIANBO SHI†
Publisher: IJCV
Date of Publication: Feb 23, 2001

Because texture provides a good local descriptor of image patches, this paper proposes a general algorithm for image segmentation by using texture. They apply different filters to the input images and get the filter responses which can be combined into response vectors, which is called texton. Each pixel is mapped into exactly one texton, and using K-means to cluster the corresponding pixels. A descriptor also considers the orientation and scale and construct the windowed texton histogram in a local region. Finally, exploiting the cues of contour and texture differences to succeed in image segmentation.

2009年3月22日

Shape Matching and Object Recognition Using Shape Contexts

Title: Shape Matching and Object Recognition Using Shape Contexts
Author: Serge Belongie, Jitendra Malik, Jan Puzincha
Publisher: IEEE
Month of Publication: April 2002

The paper proposes a stable and simple algorithm for finding corresponding between shapes. They introduce a shape descriptor, shape context, and maximize the similarity in the bipartite graph. The demonstration of 2D objects, e.g., handwritten digits, silhouettes, and trademarks, and 3D objects from Columbia COIL data set shows the improved performance.

First, a rich local descriptor, shape context, is proposed in order to match easier. It considers the set of vectors originating from a point to all other sample points on a shape. For a point on the shape, compute a coarse histogram of the relative coordinates of the remaining points according to the bins in log-polar space, making nearer points have more weighting. Use chi-square test statistic to be the cost C and identify the similarity. Second, when minimizing the cost of bipartite graph matching, they consider the scale invariance by normalizing all radial distance and rotation invariance by turning relative frames with the tangent angle. Moreover, one can add “dummy” nodes to get robust handling of outliers. Third, in the modeling transformation, they use the thin plate spline (TPS) model which includes the affine model, it is possible to estimate transformations in few iterations. Finally, estimate shape distances as the weighted sum of three terms: shape context distance, image appearance distance, and bending energy. And then apply the prototype-based approach. That is, use a variant of K-means, K-medoids, to select a ideal example for each category, and classify the query shape based on the minimal cost.

In conclusion, it is able to retrieval the objects which have similar shapes with the query, and the performance is really improved.

In my opinion, it is thoughtful that they consider several key points and construct the distance weighting function based on three term. However, it is possible that more parameters may cause biased query results. Maybe they should show individual statistic for each terms and convey us that the weighting function is really believable. Besides, I think that the modified K-means may be helpful in some cases. The outlier removal and warped transformation seems useful to remove noises according to the figure 4.

2009年3月13日

Nonlinear Dimensionality Reduction by Locally Linear Embedding

Title : Nonlinear Dimensionality Reduction by Locally Linear Embedding
Author : S. T. Roweis and L. K. Saul
Year of Publication: 2000

The paper introduces an unsupervised learning algorithm, locally linear embedding(LLE). The following is the steps of locally linear embedding.

1.Find the neighbors of each point which close to a locally linear patch the manifold. (kNN in this paper)
2.Reconstruct each data point from its neighbors with the constrained least-square error.
3.Compute the low-dimensional embedding vector by minimizing the embedded cost function.

In summary, unlike PCA or MDS, LLE succeeds in mapping nearby data points in the low dimensional system. LLE avoids solving large dynamic programming problems and accumulate sparse matrices; consequently, LLE saves a lot of time and space in the computation.

In my opinion, LLE is easy to figure out because of three simple steps. However, the first time I read the paper, I have no idea about what LLE is. That is, the paper seems to organize many research results together and hide the details, so it may expect the readers to have relative background knowledge or survey the references when reading.

2009年3月9日

Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection

Title: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection
Author: Peter N. Belhumeur, Joao P. Hespanha, and David J. kriegman
Year of Publication: 1997
Publisher: IEEE

In this paper, the authors proposed the method, Fisherfaces, for face recognition. Fisherfaces is insensitive to large variations in illumination and facial expressions. Compared with eigenfaces which uses PCA for dimensionality reduction and maximize the total scatter across all classes, fisherfaces maximize the ratio of between-class scatter to that of the within-in class scatter. They compare four methods for recognition under variation in lighting and facial expression: correlation, a variant of the linear subspace method, the Eigenface method, and the Fisherface method.
(1) Correlation is the simplest method but need variant lighting training data and require large time complexity and storage.
(2) Eigenfaces apply PCA and reduces time complexity a lot. However, when it maximizes between-class scatter, it also maximizes with-in class scatter which is unwanted information for face recognition. It have been suggested that by discarding the three most significant principal components, the effects of variant illumination may be reduced, but it may also result in unexpected consequence.
(3) The linear subspace algorithm take the normal vector to the surface and the albedo of the surface into consideration. That is, the algorithm can easily recognize Lambertian surfaces and be insensitive to a wide range of lighting conditions. Nevertheless, it has to learn where the good regions for recognition are, and its computation and storage are higher than the Eigenfaces method.
(4) Fisherfaces use the Fisher’s Linear Discriminant method which is class specific. The approach maximize the ratio of the between-class scatter and the within-class scatter, that is, achieve greater between-class scatter and decrease within-class scatter.

In conclusion, it shows that fisherfaces perform better than other three methods in the several experiments, variant lighting, facial expression, and glasses recognition. It is based on more reasonable dimensionality reduction, and it requires lower computation by modifying the original equation with PCA. Also, it doesn’t need storage as much as the Linear Subspace method. What it can be improved is how to deal with extreme lighting condition and , maybe, side face recognition.

2009年3月8日

Eigenfaces for Recognition

Title: Eigenfaces for Recognition
Author: Matthew Turk and Alex Pentland
Year of Publication: 1991

Eigenfaces is a approach that decomposes face images into a small set of characteristic feature images and is based on information theory. In other words, the approach extracts the information contained in a collection of face images and uses the variation of these images to encode and compare individual face images to do face recognition.

In mathematical terms, treat an image as a vector in a very high dimensional space and regard the eigenvectors as a set of features that characterize the variation between face images. By using principal component analysis(PCA), it is possible to construct the subspace of face images, called “face space”, with lower dimension. Each individual face can be represented in terms of linear combination of the vectors which are referred to as “eigenfaces” because of face-like appearance. Simply, there are operations as following.

(1) Collect several face images for each person.
(2) Calculate the eigenfaces with the highest associated eigenvalues.
(3) For each known individual, project their face images onto the “face space”. Choose a threshold that defines the maximum allowable distance from any face class.
(4) Calculate the pattern vector for each new face images and the distances to each know class.
(5) If the input image is near face space, it is recognized. And if it is near a known face class, it is known and added to the original set of similar face images to recalculate the eigenfaces; otherwise, it is unknown and it may be used to a new face class.

Moreover, it is possible to detect motion after filtering and rescaling the input image appropriately. Calculate the orientation of the motion of the head or use simple symmetric operators can benefit the recognition of the face rotation.

In my opinion, the approach applying PCA to reduce the dimension is not very difficult, and it is really a good method to construct the eigenfaces. That is, face recognition is nothing but extract the features of face images and decide whether a face is efficiently, and this approach reaches the goal. However, the variant background, the scale of the input images, or the illumination still affect the recognition result a lot. The training set also has a significant effect for the precision of the recognition. How to decide the tradeoff between remained dimensions and the precision rate is another problem.

2009年3月3日

Scale & Affine Invariant Interest Point Detectors

Title:Scale & Affine Invariant Interest Point Detectors
Author: KRYSTIAN MIKOLAJCZYK AND CORDELIA SCHMID
Date of Publication: January 22, 2004

The paper describes two approaches for scale and affine invariant interest point detection. The scale invariant interest point detector, Harris-Laplace detector, combines the Harris detector with automatic scale selection. The algorithm involves a multi-scale point detection and an iterative selection of the scale and the location. The affine invariant interest point detector is initialized by the multi-scale Harris detector. Compute integration and differentiation scale to obtain shape matrix for each interest point. Finally, converge to a local structure in the iterative procedure.

What a pity is that there is a trade off between scale detection and affine detection because of different effects of the two kinds of detector.

Distinctive Image Featuresfrom Scale-Invariant Keypoints

Title:Distinctive Image Featuresfrom Scale-Invariant Keypoints
Author:David G. Lowe
Date of Publication: January 5, 2004

This paper describes an approach, Scale Invariant Feature Transform(SIFT), which transforms image data into scale-invariant coordinates relative to local features. There are several major stages of computation.

(1) The scale-space extrema detection stage searches all scales and image location and uses a cascade filtering approach to identify potential interest points that are invariant to scale and orientation. That is, compute the scale space of an image, a function produced from convolution of a variable-scale Gaussian with an input image, for scale feature description across all possible scales, and then use the function to compute the difference-of-Gaussian and find the maxima and minima of the DOG which can produce the most stable image features.

(2) The keypoint localization stage fits detailed model to determine location and scale at each candidate location and selects keypoints based on measures of their stability.

(3) The orientation assignment stage assigns one or more orientations to each keypoint location based on local image gradient directions; therefore, achieve invariance to image rotation.

(4) The keypoint descriptor stage measures the local image gradients at the selected scale in the region around each keypoint, which allows for significant levels of local shape distortion and change in illumination. In other words, a keypoint descriptor is created by computing the gradient magnitude and orientation at each image sample in a region. After weighted by a Gaussian window, the samples are accumulated into orientation histograms summarizing the contents over subregions. To improve the effects of shift and illumination change, trilinear interpolation, vector normalization and thresholding are taken into consideration.

In experiment on object recognition, an approximation algorithm, Best-Bin-First(BBF), is used for efficient nearest neighbor indexing to find minimum Euclidean distance in matching. Take Hough transform to cluster features. After solving affine parameters by least-square, accept a model if final probability is higher than a threshold.

Image Retrieval: Ideas, Influences, and Trends of the New Age

Title: Image Retrieval: Ideas, Influences, and Trends of the New Age
Authors: RITENDRA DATTA, DHIRAJ JOSHI, JIA LI, and JAMES Z. WANG
Year of Publication: 2008
Publisher: ACM

Content-based image retrieval is an technology helps to organize digital picture archives by their visual content. There are two gaps, sensory gap and semantic gap, which define and motivate most of the related problems. The sensory gap is a gap between real object and the descriptive information, and the semantic gap is about the lack of coincidence between the information from the visual data and the interpretation from a user. Therefore, how to solve the gaps and satisfy users is the goal of CBIR.

The first important thing is to clarify user-system interaction. A user perspective involves what the user wants and what is the form used in query, and a system perspective is about how to interact with user. Simply, human-center based system is required for different kinds of user intent with several query types, including keywords, free-text, image, graphics, and composite. For example, it is possible for a composite query method to provide a system involving gestures and speech for querying, or help user refine the queries by hints. If a system can collect manual tags for pictures, not only facilitating text-based querying, but also building reliable training datasets. Moreover, how to design a retrieval system on portable devices which have many constraints, such as limited size and color depth of display, is one of the issues of visualization.

The two core problems of CBIR are (1) how to define a mathematical description or a signature of an image, (2) how to decide the similarity between a pair of images.

For region-based visual signatures, the first step is image segmentation. With k-means clustering or normalized cut criteria method, segmentation helps image understanding and extracts several types of features. A feature capture a specific property of an image, either globally for the entire image with higher speed for computation, or locally for a small group of pixels with more specific identification of important visual characteristics. Color features are usually summarized into histogram. Texture features are used to capture granularity and repetitive patterns. Shape is a key to specify regions. Spatial modeling and matching are regarding to local image entities. Interest points that can deal with significant affine transformation and illumination changes are based on local invariants. When constructing signatures from features, histograms is easy but tend to be sparse in multidimensional space. A region-based signature allow representative vector to adapt images and the region of color and texture is likely corresponding to an object in an image.

There are three types of signatures, feature vector, summary of local feature vectors , and region-based signature. Each of them has different appropriate similarity measures. Using the geodesic distances for a single vector may be better. Summaries of local feature vectors such as codebook and probability density functions are generated by vector quantization and KL distance separately. The region-based signature can form a histogram, and calculate the similarity from the pair-wise distances between individual vectors. More matching methods improve the basic idea from region weights, speed, or segmentation.

Due to faster retrieval, clustering and classification is practical and useful. Classification is treated as a preprocessing step and improve accuracy but require prior training data. Clustering helps visualization and retrieval efficiency but may not representative enough or accurate for visualization. Besides, in order to capture user’s precise needs, relevance feedback system which does iterative feedback and refinement is designed. That is, relevance feedback let users give feedback after querying, and the system learns case-specific query semantics dynamically according to the feedback.

There are some offshoots of CBIR. First, automated annotation attempts at automated concept discovery; what is more is that deciding on an appropriate picture set for a given story. Second, ranking or similarity of images is usually sorted by size, color depth, or shape; however, aesthetics may be another higher-level basis which involves the feelings or emotions of people. Moreover, CBIR may concern with possible security attack or image copy protection.

Finally, evaluation benchmark of CBIR must some key points: coverage, unbiasedness, and user focus. Ideally, it should be subjective, context-specific, and community-based.