# Mining and Summarizing Customer Reviews

Mining and Summarizing Customer Reviews – Hu and Liu 2004

This is the third of the three ‘test-of-time’ award winners from KDD’15. From the awards page:

The paper introduces the problem of summarizing customer reviews and decomposes the problem into the three steps of (1) mining product features (aspects), (2) identifying opinion sentences and their corresponding feature in each review and (3) summarizing the results. The paper has inspired the new research direction of Aspect-Based Sentiment Analysis/Aspect-Based Opinion Mining, and the proposed framework has been widely adopted in research and applications, as seen from the very large number of citations.

The goal is to mine an existing corpus of product reviews and produce summaries of the form:

Digital Camera XYZ:
Feature: Picture Quality
Positive: 253
"Overall this is a good camera with a really good picture clarity"
... other review sentences
Negative: 6
"The pictures come out hazy if your hands shake even for a moment
during the entire process of taking a picture"
... other review sentences
Feature: Size
Positive: 134
... review sentences
Negative: 10
... review sentences


After crawling reviews, the first step in the process is Part-of-Speech (POS) tagging to indicate whether words are nouns, verbs, adverbs, adjectives and so on. Product features are usually nouns or noun-phrases. Some pre-processing of words (removal of stop words, stemming, and fuzzy matching to deal with word variants and misspellings) is also performed.

Next comes feature extraction from sentences.

Since our system aims to find what people like and dislike about a given product, how to find the product features that people talk about is the crucial step. However, due to the difficulty of natural language understanding, some types of sentences are hard to deal with.

“The pictures are very clear” is an example of an ‘easy’ sentence (clearly about pictures), but the sentence “while light, it will not easily fit in pockets” is about the product features of size and weight, but nowhere is ‘size’ mentioned in the sentence. For the time being, the authors choose to ignore this latter kind of sentence and focus on those where features appear explicitly in nouns and noun phrases.

Association mining is then used to find frequent features – those talked about by many customers.

The main reason for using association mining is because of the following observation. It is common that a customer review contains many things that are not directly related to product features. Different customers usually have different stories. However, when they comment on product features, the words that they use converge. Thus using association mining to find frequent itemsets is appropriate because those frequent itemsets are likely to be product features. Those noun/noun phrases that are infrequent are likely to be non-product features.

The candidate frequent itemsets are the pruned. Itemsets are simply groups of words that occur together, but order matters so compactness pruning removes candidate features whose words do not appear in a specific order. Redundancy pruning removes redundant features that contain single words:

To describe the meaning of redundant features, we use the concept of p-support (pure support). p-support of feature ftr is the number of sentences that ftr appears in as a noun or noun phrase, and these sentences must contain no feature phrase that is a superset of ftr. We use a minimum p-support value to prune those redundant features. If a feature has a p-support lower than the minimum p-support (in our system, we set it to 3) and the feature is a subset of another feature phrase (which suggests that the feature alone may not be interesting), it is pruned. For instance, ‘life’ by itself is not a useful feature while ‘battery life’ is a meaningful feature phrase.

The next stage is to identify opinion words and their sentiments. From the POS tagging, we know that adjectives are likely to be opinion words. Sentences with one or more product features and one or more opinion words are opinion sentences. For each feature in the sentence, the nearest opinion word is recorded as the effective opinion of the feature in the sentence.

To predict the semantic orientation of adjectives (whether they are positive or negative), the authors use the adjective synonym and antonym sets in WordNet.