TEXT MINING TECHNIQUES RORY DUTHIE

Similar documents
Information Extraction. CS6200 Information Retrieval (and a sort of advertisement for NLP in the spring)

Identifying Anaphoric and Non- Anaphoric Noun Phrases to Improve Coreference Resolution

That's Your Evidence?: Using Mechanical Turk To Develop A Computational Account Of Debate And Argumentation In Online Forums

Anaphora Resolution in Biomedical Literature: A

Outline of today s lecture

Reference Resolution. Regina Barzilay. February 23, 2004

Reference Resolution. Announcements. Last Time. 3/3 first part of the projects Example topics

Intelligent Agent for Information Extraction from Arabic Text without Machine Translation

Anaphora Resolution in Hindi Language

Question Answering. CS486 / 686 University of Waterloo Lecture 23: April 1 st, CS486/686 Slides (c) 2014 P. Poupart 1

Gesture recognition with Kinect. Joakim Larsson

Houghton Mifflin English 2004 Houghton Mifflin Company Level Four correlated to Tennessee Learning Expectations and Draft Performance Indicators

Visual Analytics Based Authorship Discrimination Using Gaussian Mixture Models and Self Organising Maps: Application on Quran and Hadith

StoryTown Reading/Language Arts Grade 2

ADAIR COUNTY SCHOOL DISTRICT GRADE 03 REPORT CARD Page 1 of 5

Automatic Evaluation for Anaphora Resolution in SUPAR system 1

Winning on the Merits: The Joint Effects of Content and Style on Debate Outcomes

Prentice Hall Literature: Timeless Voices, Timeless Themes, Bronze Level '2002 Correlated to: Oregon Language Arts Content Standards (Grade 7)

Anaphora Resolution in Biomedical Literature: A Hybrid Approach

Coreference Resolution Lecture 15: October 30, Reference Resolution

A Machine Learning Approach to Resolve Event Anaphora

Prentice Hall Literature: Timeless Voices, Timeless Themes, Silver Level '2002 Correlated to: Oregon Language Arts Content Standards (Grade 8)

Correlation to Georgia Quality Core Curriculum

QCAA Study of Religion 2019 v1.1 General Senior Syllabus

Resolving Direct and Indirect Anaphora for Japanese Definite Noun Phrases

Extracting the Semantics of Understood-and- Pronounced of Qur anic Vocabularies Using a Text Mining Approach

Impact of Anaphora Resolution on Opinion Target Identification

BBC LEARNING ENGLISH 6 Minute Vocabulary Someone, nothing, anywhere...

Tuen Mun Ling Liang Church

CS224W Project Proposal: Characterizing and Predicting Dogmatic Networks

Continuum for Opinion/Argument Writing Sixth Grade Updated 10/4/12 Grade 5 (2 points)

The UPV at 2007

Keywords Coreference resolution, anaphora resolution, cataphora, exaphora, annotation.

Who? What? Where? When? Why? How? People Events Places Time Reason or purpose Means or method

QUESTION ANSWERING SYSTEM USING SIMILARITY AND CLASSIFICATION TECHNIQUES

Christ-Centered Preaching: Preparation and Delivery of Sermons Lesson 6a, page 1

Tips for Using Logos Bible Software Version 3

StoryTown Reading/Language Arts Grade 3

Corporate Team Training Session # 2 June 8 / 10

Houghton Mifflin English 2004 Houghton Mifflin Company Grade Six. correlated to. TerraNova, Second Edition Level 16

Ms. Shruti Aggarwal Assistant Professor S.G.G.S.W.U. Fatehgarh Sahib

Stratford School Academy Schemes of Work

DELHI PUBLIC SCHOOL, SRINAGAR

807 - TEXT ANALYTICS. Anaphora resolution: the problem

Initiative. Leadership. Organisation. Communication. Resilience. PiXL Edge Evaluation Tips. Attribute. Buzzwords

Report about the Latest Results of Precipitation Verification over Italy

AN OUTLINE OF CRITICAL THINKING

A Framework for Thinking Ethically

ELA CCSS Grade Three. Third Grade Reading Standards for Literature (RL)

Houghton Mifflin Harcourt Collections 2015 Grade 8. Indiana Academic Standards English/Language Arts Grade 8

Corporate Team Training Session # 2 May 30 / June 1

Using Machine Learning Algorithms for Categorizing Quranic Chapters by Major Phases of Prophet Mohammad s Messengership

AliQAn, Spanish QA System at multilingual

Video: How does understanding whether or not an argument is inductive or deductive help me?

RELIGION Islam It is not necessary to carry out all the activities contained in this unit.

The Disadvantage Uniqueness: Link:

Lesson 7: Pain. In today's chapters Jonas receives painful memories from The Giver. How do you think he will respond to these memories?

Georgia Quality Core Curriculum

Sentiment Flow! A General Model of Web Review Argumentation

An Efficient Indexing Approach to Find Quranic Symbols in Large Texts

1 Clarion Logic Notes Chapter 4

The Fifth National Survey of Religion and Politics: A Baseline for the 2008 Presidential Election. John C. Green

Anaphora Resolution. Nuno Nobre

DO YOU WANT TO WRITE:

Prentice Hall U.S. History Modern America 2013

GCSE Subject Level Guidance for Ancient Languages March 2017

This report is organized in four sections. The first section discusses the sample design. The next

Scott Foresman Reading Street Common Core 2013

GENERAL ADVICE ABOUT WJEC GCSE RS

Approaches to Bible Study

PAGE(S) WHERE TAUGHT (If submission is not text, cite appropriate resource(s))

Survey of Pastors. Source of Data in This Report

CATECHISM OF THE CATHOLIC CHURCH

Lesson 12: God takes Elijah to Heaven November 20/21

Ace the Bold Face Sample Copy Not for Sale

Prentice Hall United States History Survey Edition 2013

Everyone, anyone, someone, nobody, each, much, one, neither, and either are considered plural. A)True B) False

Dave Piscitello: issues and try to (trap) him to try to get him into a (case) to take him to the vet.

Statistical anaphora resolution in biomedical texts

Correlates to Ohio State Standards

Biometrics Prof. Phalguni Gupta Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Lecture No.

SB=Student Book TE=Teacher s Edition WP=Workbook Plus RW=Reteaching Workbook 47

A Survey on Anaphora Resolution Toolkits

THE BASIC GUIDE TO STUDY BIBLES

Building Your Framework everydaydebate.blogspot.com by James M. Kellams

Westminster Presbyterian Church Discernment Process TEAM B

Agnostic Learning with Ensembles of Classifiers

A Correlation of. To the. Language Arts Florida Standards (LAFS) Grade 3

May Parish Life Survey. St. Mary of the Knobs Floyds Knobs, Indiana

Document-level context in deep recurrent neural networks

Focusing the It s Time Urban Mission Initiative

Final Exam (PRACTICE 4) #4

Year 4 Medium Term Planning

PW Historian Workshop

Discussion Notes for Bayesian Reasoning

Scott Foresman Reading Street Common Core 2013

7.1. Unit. Terms and Propositions. Nature of propositions. Types of proposition. Classification of propositions

Towards a more consistent and comprehensive evaluation of anaphora resolution algorithms and systems

Chadwick Prize Winner: Christian Michel THE LIAR PARADOX OUTSIDE-IN

STUDY QUESTIONS. 1. What NT verse tells us we need to interpret the Bible correctly? (1)

Transcription:

TEXT MINING TECHNIQUES RORY DUTHIE

OUTLINE Example text to extract information. Techniques which can be used to extract that information. Libraries How to measure accuracy.

EXAMPLE TEXT Mr. Jack Ashley (Stoke-on-Trent, South): The hon. Member for Luton, South (Mr. Bright) made a fine speech. I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. He was trying to have it both ways and, of course, he failed. The hon. Member for Rugby and Kenilworth offered the House a completely bogus argument in saying that Britain has always had corporal punishment and that it is a tradition. Imagine the House debating slavery and advocates of that practice saying, "We have always had slavery in this country and it is a tradition." Imagine our predecessors in this place saying, "We have always denied women the vote, so why should we now allow women to vote? It is a tradition that they have not had the vote." I know that even now some hon. Members do not like the concept of women having the vote. Some Members prefer slavery and others prefer corporal punishment. Let it be understood that the tradition argument is bogus and nonsensical. The hon. Member for Rugby and Kenilworth was offering us a message of despair, in stark contrast to the message offered by Mr. Martin Rosenbaum of the Society of Teachers Opposed to Physical Punishment. He has presented a marvellous message of enlightenment, in which he sets out how Britain can rid itself of corporal punishment to the advantage of teachers and pupils. I pay tribute to Mr. Rosenbaum for the fine work that he has done on behalf of STOPP. Those who oppose STOPP surely do not know anything about the research that it has conducted. Hansard Corporal Punishment (22/07/1986): http://hansard.millbanksystems.com/commons/1986/jul/22/abolition-of-corporalpunishment

WHAT WE WANT TO EXTRACT Extract relations between people automatically. Every mention of people as individuals Or Organisations which we consider to have the same properties as a person.

WHY ONLY INDIVIDUALS? Some Members prefer slavery and others prefer corporal punishment We know members are being attacked by this statement by the comparison of slavery to the topic of debate corporal punishment. Could we definitely say which members that is though? Could we do that automatically?

WHAT WE WANT TO EXTRACT Mr. Jack Ashley (Stoke-on-Trent, South): The hon. Member for Luton, South (Mr. Bright) made a fine speech. I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. He was trying to have it both ways and, of course, he failed. The hon. Member for Rugby and Kenilworth offered the House a completely bogus argument in saying that Britain has always had corporal punishment and that it is a tradition. Imagine the House debating slavery and advocates of that practice saying, "We have always had slavery in this country and it is a tradition." Imagine our predecessors in this place saying, "We have always denied women the vote, so why should we now allow women to vote? It is a tradition that they have not had the vote." I know that even now some hon. Members do not like the concept of women having the vote. Some Members prefer slavery and others prefer corporal punishment. Let it be understood that the tradition argument is bogus and nonsensical. The hon. Member for Rugby and Kenilworth was offering us a message of despair, in stark contrast to the message offered by Mr. Martin Rosenbaum of the Society of Teachers Opposed to Physical Punishment. He has presented a marvellous message of enlightenment, in which he sets out how Britain can rid itself of corporal punishment to the advantage of teachers and pupils. I pay tribute to Mr. Rosenbaum for the fine work that he has done on behalf of STOPP. Those who oppose STOPP surely do not know anything about the research that it has conducted. Green: Positive Red: Negative

EXTRACTION TECHNIQUES The hon. Member for Luton, South (Mr. Bright) made a fine speech. I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. Each of these sentences can be extracted in the same way. Requires research into the data as a whole.

DOMAIN SPECIFIC RULES The hon. Member for Luton, South (Mr. Bright) made a fine speech. I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. All extracted using rules which are specific to the topic domain. These can be applied to any topic as long as there are common properties of the sentence. Created by you.

EXTRACTION TECHNIQUES He was trying to have it both ways and, of course, he failed.

PART-OF-SPEECH (POS) TAGGER He was trying to have it both ways and, of course, he failed. We break a sentence down into nouns, verbs, adjectives etc. Search for POS we need so we can determine if we should extract the sentence. We can do this using a library. Stanford Part-Of-Speech Tagger: http://nlp.stanford.edu/software/tagger.shtml

STANFORD PART-OF-SPEECH TAGGER Sentences are broken down into individual tokens (words) and then each assigned a POS. Uses Penn Treebank Tag set: NN Noun NNS Noun, Plural NNP Proper noun, singular NNPS Proper Noun plural PRP Personal Pronoun PRP$ - Possessive Pronoun We can look for tokens in a sentence which have been tagged as Pronouns (He, She, him, her). Then extract the sentence if it contains a pronoun.

UPDATE The hon. Member for Luton, South (Mr. Bright) made a fine speech. I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. He was trying to have it both ways and, of course, he failed. The hon. Member for Rugby and Kenilworth offered the House a completely bogus argument in saying that Britain has always had corporal punishment and that it is a tradition. He has presented a marvellous message of enlightenment, in which he sets out how Britain can rid itself of corporal punishment to the advantage of teachers and pupils.

EXTRACTION TECHNIQUES The hon. Member for Rugby and Kenilworth was offering us a message of despair, in stark contrast to the message offered by Mr. Martin Rosenbaum of the Society of Teachers Opposed to Physical Punishment. I pay tribute to Mr. Rosenbaum for the fine work that he has done on behalf of STOPP. Both sentences can be extracted by domain specific rules and POS tagging. BUT, Both can be extracted by something else.

NAMED ENTITY RECOGNITION (NER) The hon. Member for Rugby and Kenilworth was offering us a message of despair, in stark contrast to the message offered by Mr. Martin Rosenbaum of the Society of Teachers Opposed to Physical Punishment. I pay tribute to Mr. Rosenbaum for the fine work that he has done on behalf of STOPP. Each sentence can be extracted using (NER). We can look for people in sentences and then decide if the sentence should be extracted. Use Stanford NER: http://nlp.stanford.edu/software/crf-ner.shtml

STANFORD NER Have the use of 3 different models: 3 class, 4 class and 7 class. 3 class: Location, Person, Organisations 4 class: Location, Person, Organisation, Misc 7 class: Location, Person, Organisation, Money, Percent, Date, Time. Each are trained using slightly different data so the accuracy of each model to decide what is and isn t a person will vary.

STANFORD NER 7 class looks like the best option just because it does more. Accuracy is a problem however and because it does more it may not have the greatest fine grained accuracy. Use 4 class.

UPDATE We ve managed to extract all the sentences we want from the text. Our main goal though was to extract relations between speakers. So we still have to: Decide which sentence was said by who. Who the target for each sentence was. Whether a sentence is positive or negative.

WHO SAID WHAT AND WHO WAS THE TARGET This is solved using Anaphora resolution. Which basically means referring back to something said earlier. To find who said what. This is easy in Hansard because every statement is marked by a speaker. In this case it is Mr. Jack Ashley. Deciding the target for each sentence is much harder.

TARGET FOR SENTENCES The hon. Member for Luton, South (Mr. Bright) made a fine speech. I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. We can use the previous techniques we devised to extract sentences and instead find the target. We search for hon. Member and if a location or name in a bracket is used then we have the target for our sentence. How do we decide if there is a location? NER it extracts locations OR the use of the word for which is a bit easier.

TARGET FOR SENTENCES The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. He was trying to have it both ways and, of course, he failed. We can look back the way, if the speaker uses He or hon. Gentleman we know at an earlier point the target is referred to. We can look back a sentence at a time and use the same set of rules for targets and then we know who the target for these sentences is. I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). This precedes both sentences so the target is Mr. Pawsey.

TARGET FOR SENTENCES We can also extend NER and use it for Anaphora resolution. If a sentence contains a persons name then the target is that person.

UPDATE The hon. Member for Luton, South (Mr. Bright) made a fine speech. Target: Mr. Bright Source: Mr. Ashley I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). Target: Mr. Pawsey Source: Mr. Ashley The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. Target: Mr. Pawsey Source: Mr. Ashley

DECIDING POLARITY OF SENTENCES To get ethos and decide whether or not it is positive or negative we need to perform Sentiment Analysis. Sentiment analysis involves using features from the sentence which can then be used to define if it is positive or negative. Two approaches: Discourse OR Machine Learning

DISCOURSE SENTIMENT ANALYSIS Keywords are used to decide if a sentence is positive or negative. This is done using word dictionaries with words classified as either positive or negative. Manual. Bad Negative. Good Positive. Each sentence is tallied for its number of negative and positive words. With the most of either producing the classification to positive or negative.

DISCOURSE SENTIMENT ANALYSIS Bing Lui created a dictionary of around 6,000 words: https://www.cs.uic.edu/~liub/fbs/sentiment-analysis.html A list of positive and negative words. Words are not plural

HOW TO AID DISCOURSE SENTIMENT ANALYSIS Removal of stop words. E.g. The, A etc. This has been argued for and against with varying and results. Depends on the data set. Stemming and Lemmatization Involves producing the base form of a word from a plural. Cats cat Cars car Use Stanford Stemming and Lemmatization: http://nlp.stanford.edu/ir-book/html/htmledition/stemming-and-lemmatization-1.html Could mean that the sentiment analysis is more accurate.

PROBLEMS WITH DISCOURSE SENTIMENT ANALYSIS Words preceded by not. E.g. Not good, Not bad. Completely changes the polarity from negative to positive and vice versa. Words which have not been considered E.g. Bogus. An uncommon word which may not be contained an a sentiment dictionary. Time consuming. To get accurate sentiment analysis you would need to create an exhaustive list of all good or bad words. If your text is over 40,000 words it can take a long time.

MACHINE LEARNING SENTIMENT ANALYSIS Machine learning can be used for this purpose. This involves training a classifier on some manually annotated data and then running it over text to produce classifications. We can take a percentage of our transcript say 70% and use that as training data and then use the remaining 30% as test data.

MACHINE LEARNING SENTIMENT ANALYSIS Naïve Bayes and Support Vector Machines (SVM). Tonnes of online tutorials and tools you can use. SVM s tend to produce a higher accuracy than Naïve Bayes when performing sentiment analysis. But it s dependent on the dataset. http://www.cs.cornell.edu/home/llee/papers/sentiment.pdf

ML BASIC APPROACH To use ML we need to extract features from the text that can highlight important information. Most basic approach is Bag of Words We take all the words for a sentence and create a dictionary from them. The cat sat on the mat, is transferred into: 1:2 2:1 3:1 4:1 5:1 When performing sentiment analysis we can the decide if this sentence is positive or negative and let our SVM classifier do the work. This is just words but anything can be added as a feature to: length, POS etc.

ML BASIC APPROACH We can then extract unigrams, bigrams and trigrams and use these as features in our dictionary too. So we get: The, cat, sat, on, mat, the cat, cat sat, sat on, on the, the mat, the cat sat, etc.

PROBLEM S USING ML We need to have a very large data set. If we have 30 sentences which we use to classify a full transcript containing 100 sentences, accuracy of the sentiment classifier will be poor. Again ML training is time consuming as it involves manually annotating a large amount of data.

PROBLEM OF USING ML ON OUR DATA The hon. Member for Luton, South (Mr. Bright) made a fine speech. Target: Mr. Bright Source: Mr. Ashley I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). Target: Mr. Pawsey Source: Mr. Ashley The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. Target: Mr. Pawsey Source: Mr. Ashley

PROBLEM OF USING ML ON OUR DATA The hon. Member for Luton, South (Mr. Bright) made a fine speech. Target: Mr. Bright Source: Mr. Ashley I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). Target: Mr. Pawsey Source: Mr. Ashley The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. Target: Mr. Pawsey Source: Mr. Ashley

PROBLEM OF USING ML ON OUR DATA We need to remove names, locations, domain rules so that the SVM isn t influenced with these when classifying. Again need to perform Stemming and Lemmatization so that words can be matched in the dictionary. Again we need to decide if the removal of stop words will help in classifying data. All in all ML is mainly about trial and error on your data.

UPDATE The hon. Member for Luton, South (Mr. Bright) made a fine speech. Target: Mr. Bright Source: Mr. Ashley Sentiment: Positive I have never heard such a bad speech from the hon. Member for Rugby and Kenilworth (Mr. Pawsey). Target: Mr. Pawsey Source: Mr. Ashley Sentiment: Negative The hon. Gentleman tried to observe the letter of the EEC regulations, while dodging their spirit. Target: Mr. Pawsey Source: Mr. Ashley Sentiment: Negative

CHALLENGES What if we want to classify a sentence with double meaning? How do we classify the sentiment of this? Who is the target? The hon. Member for Rugby and Kenilworth was offering us a message of despair, in stark contrast to the message offered by Mr. Martin Rosenbaum of the Society of Teachers Opposed to Physical Punishment.

CHALLENGES We can segment the sentence: The hon. Member for Rugby and Kenilworth was offering us a message of despair, in stark contrast to the message offered by Mr. Martin Rosenbaum of the Society of Teachers Opposed to Physical Punishment. But do we then lose the context of the previous sentence? We could use the previous sentence as a feature for classifying the next.

EVALUATING THE CLASSIFIER Many different metrics to evaluate the effectiveness of a classifier F1-score, accuracy, Kappa They all have advantages Again depending on your dataset

EXAMPLE We have 100 sentences, with 20 containing a relationship between two people. We classify 5 correctly as having a relationship, and classify 70 as not containing a relationship. So 15 are classified as negative when they are positive and 10 classified as positive when they are negative. Use a confusion matrix to help.

CONFUSION MATRIX True Positives 5 False Positives 10 False Negatives 15 True Negatives 70

EVALUATION We can calculate accuracy: TP + TN / TP + TN + FN + FP = 0.75 But we only classified 5 relations correctly yet we have 75% accuracy. F1-score is better for evaluating this: We need precision (P): TP / TP + FP = 0.33 We need recall (R): TP / TP + FN = 0.25 F1-score = 2 X ((P X R) / (P + R)) = 0.28

SUMMARY Example text to extract information. Techniques which can be used to extract that information. Stanford Libraries for extraction ML Some of the challenges we face. Evaluation of classifier