Search This Blog

Tuesday, May 9, 2017

Text classification using natural language processing through python NLTK and Redis - TECHEPI

Text classification using natural language processing through python NLTK and Redis - TECHEPI

Text classification using natural language processing through python NLTK and Redis

What is natural language processing ?

Natural language processing is approach to make a computer program to identify speech like human speech processing. natural language processing is based on artificial intelligent (AI) which is analyze, understand and then generate the text/speech. In other way you can say NLP enable machines to understand human language and extract meaning from them.
NLP can learn automatically all types of rules to analyze a set of text/speech.
“One of the most compelling ways NLP offers valuable intelligence is by tracking sentiment — the tone of a written message (tweet, Facebook update, etc.) — and tag that text as positive, negative or neutral,” Rehling said
Other than facebook, google, twitter, IBM there are many startup which one is providing business solutions using NLP :
  • Recorded Future (cyber security)
  • Quid (strategy)
  • Narrative Science (journalism)
  • (intent classification, acquired by FB)
  • (scheduling)
  • Kensho (finance)
  • Predata (open intelligence)
  • Lattice (sales and marketing)
  • AlchemyAPI (NLP APIs, acquired by IBM)
  • Basis (NLP APIs)
 NLP business applications today include following things :
  • Machine translation
  • Text classification
  • Text summarization
  • Chat Bot
  • Sentence segmentation
  • Customer service
  • Reputation monitoring
  • Ad placement
  • Market intelligence
  • Regulatory compliance
Stanford NLP(Java) and NLTK (Python) are two major open source library to implement natural language processing, but here I am Exp-laing NLTK .

Install NLTK library and dependencies using PIP

pip install -U nltk 
pip install -U numpy

Install redis

pip install redis

Solution to identify tweet text is positive or negative using Text classification through NLP

There is some Positive tweets for training:
I love this car.
This view is amazing.
I feel great this morning.
I am so excited about the concert.
He is my best friend.
There is some Negative tweets for training:
I do not like this car.
This view is horrible.
I feel tired this morning.
I am not looking forward to the concert.
He is my enemy.
I wanna to test some below tweets which one is positive or negative :
I like this amazing car. as positive
My house is not great. as negative.

How Internally works native bayes classifications ?

There are two formula of native bayes classifications.

where a =1 , P(xk|+ or -) Probability of every word, nj is total number of + or – words, nk is number of times word k occurs in + or – case.

Vnb is value of native bays.
P(Vj) is Probability of total positive or negative tweets.
P(Vj) = Total number of positive or negative tweets / Total tweets
Let’s understand how it’s work.
Create list of unique positive and negative tweets from above Ist 2 positive and 1 negative tweets.
<“i love this car view is amazing do not like”>
Convert all tweet into feature set.
calculate P(+) = 2/3 = .666666667
calculate P(-) = 1/3 = .333333333
P(i|+) = (1+1)/(8+10) = .111111111
P(love|+) = (1+1)/(8+10) = .111111111
P(this|+) = (1+1)/(8+10) = .111111111
P(car|+) = (1+1)/(8+10) = .111111111
P(view|+) = (1+1)/(8+10) = .111111111
P(is|+) = (1+1)/(8+10) = .111111111
P(amazing|+) = (1+1)/(8+10) = .111111111
P(do|+) = (0+1)/(8+10) = .055555556
P(not|+) = (0+1)/(8+10) = .055555556
P(like|+) = (0+1)/(8+10) = .055555556
P(i|-) = (1+1)/(6+10) = .125
P(love|-) = (0+1)/(6+10) = .0625
P(this|-) = (1+1)/(6+10) = .125
P(car|-) = (1+1)/(6+10) = .125
P(view|-) = (0+1)/(6+10) = .0625
P(is|-) = (0+1)/(6+10) = .0625
P(amazing|-) = (0+1)/(6+10) = .0625
P(do|-) = (1+1)/(6+10) = .125
P(not|-) = (1+1)/(6+10) = .125
P(like|-) = (1+1)/(6+10) = .125
I wanna to test “I like this amazing car” is positive or negative.
Vj for +ive = P(+) * P(i|+) * P(like|+) * P(this|+) * P(amazing|+) * P(car|+)
= .666666667 * .111111111 * .055555556 * .111111111 * .111111111 * .111111111
= 0.000005645
Vj for -ive = P(-) * P(i|-) * P(like|-) * P(this|-) * P(amazing|-) * P(car|-)
= .333333333 * .125 * .125 * .125 * .0625 * .125
= 0.000005086
Probability is greater for positive. So tweet is positive.

Steps to identify text using NLTK and redis

Step 1- Read tweets from file and convert into format of list

Storing all positive and negative tweets from both files into list using read_file function, after that categorized in positive and negative tweets.
def read_file(file_list):
 a_list = []
 for a_file in file_list:
 f = open(a_file, 'r')
 return a_list

for x in read_file(['positive_tweets']):
 positive = [ content for content in x.splitlines()]
 for x in read_file(['negative_tweets']):
 negative = [ content for content in x.splitlines()]

all_contents = [(content, 'positive') for content in positive]
all_contents += [(content, 'negative') for content in negative]

Step 2 – Feature extractor

Feature extractor is use to extract the sentences into words with positive or negative. Defining feature set for each list of  word which is indicating whether the document contains that word or not.
def word_extractor(sentence):
 lemmatizer = WordNetLemmatizer()
 return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(sentence)]

def feature_extractor(text, setting):
 if setting=='bow':
 return {word: count for word, count in Counter(word_extractor(text)).items()}
 return {word: True for word in word_extractor(text)}

Step 3 – Storing all feature extractor data into redis

Why Store training data into Redis?
For small set of training data you will  take to do Ist step in very less time, But whenever you will be increase your training data size then that will be take more time. So Problem will be if you want to identify tweet is positive or negative with 1,00,000 positive and negative tweets in real time then that will be not possible. because whole 1,00,000 tweet will be take approx 10-30 min for training .
Solution is to store all training data into a file or in Redis.
All contents are processing for feature extractor and then storing into redis in chunk of 10,000 as tuple within list.
content_count =1
all_features = []
key_count = 1

r = redis.StrictRedis(host='localhost', port=6379, db=0)

for (content, label) in all_contents:
 content_count +=1
 all_features.append((get_features(content, ''), label))
 if content_count == 10000:
 r.set('train_tweets_'+str(key_count), all_features)
 print(str(key_count)+" created successfully!")
 all_features = []
 content_count = 0
 key_count +=1

r.set('train_tweets_'+str(key_count), all_features)

Step 4 – Read data from redis and train

Sample output of stored redis data.
"[({u'i': True, u'feel': True, u'morning': True, u'this': True, u'tired': True}, 'negative'), ({u'do': True, u'like': True, u'i': True, u'car': True, u'this': True, u'not': True}, 'negative'), ({u'this': True, u'is': True, u'horrible': True, u'view': True}, 'negative'), ({u'this': True, u'is': True, u'amazing': True, u'view': True}, 'positive'), ({u'enemy': True, u'is': True, u'my': True, u'he': True}, 'negative'), ({u'concert': True, u'i': True, u'am': True, u'forward': True, u'looking': True, u'to': True, u'not': True, u'the': True}, 'negative'), ({u'is': True, u'my': True, u'friend': True, u'best': True, u'he': True}, 'positive'), ({u'i': True, u'this': True, u'love': True, u'car': True}, 'positive'), ({u'i': True, u'feel': True, u'great': True, u'this': True, u'morning': True}, 'positive'), ({u'about': True, u'concert': True, u'i': True, u'am': True, u'so': True, u'the': True, u'excited': True}, 'positive')]"
Reading all training sets of data from Redis and set classifier through training using Naive Bayes Classifier.
def train(train_set):
 classifier = NaiveBayesClassifier.train(train_set)
 return classifier
all_features = []
get_keys = [1,2,3,4,5]
for key in get_keys:
 all_features += eval(r.get('train_tweets_'+str(key)))

classifier = train(all_features)

Step 5 – Classify tweet

Now test tweet is positive or negative using evaluate function, which will be return accuracy as probability if accuracy is greater than .5 then below tweet is positive other wise tweet is negative.
def evaluate(train_set, test_set, classifier):
 return classify.accuracy(classifier, test_set)

test_tweet = 'I feel happy this morning';
test_contents = [(test_tweet, 'positive')]
test_set = [(feature_extractor(content, 'bow'), label) for (content, label) in test_contents]
accuracy = evaluate(all_features, test_set, classifier)
accuracy for tweet is greater than .5, so above test is marked as positive.

Wrapping Up

Natural language processing is easy to implement using NLTK library, NLTK provides lots of functionalities to implement NLP, with in this library using scikit-learn you can also implement more machine learning algorithm for better accuracy.

No comments:

Post a Comment