PlotDevice Libraries: Linguistics

Description

With the Nodebox English Linguistics library you can do grammar inflection and semantic operations on English content. You can use the library to conjugate verbs, pluralize nouns, write out numbers, find dictionary descriptions and synonyms for words, summarise texts and parse grammatical structure from sentences.

The library bundles WordNet (using Oliver Steele’s PyWordNet), NLTK, Damian Conway’s pluralisation rules, Bermi Ferrer’s singularization rules, Jason Wiener’s Brill tagger, several algorithms adopted from Michael Granger’s Ruby Linguistics module, John Wiseman’s implementation of the Regressive Imagery Dictionary, Charles K. Ogden’s list of basic English words, and Peter Norvig’s spelling corrector.

Download

linguistics.zip (15MB)
Last updated for NodeBox 1.9.4.2
Licensed under GPL
Author: Tom De Smedt

Documentation

How to get the library up and running
Categorise words as nouns, verbs, numbers and more
Categorise words as emotional, persuasive or connective
Converting between numbers and words
Quantification of numbers and lists (e.g. 367 x chicken = hundreds of chickens)
Indefinite article: a or an
Pluralization/singularization of nouns
Emotional value of a word
WordNet glossary, synonyms, antonyms, components
Verb conjugation
Spelling corrections
Shallow parsing, the grammatical structure of a sentence
Summarisation of text to keywords
Regressive Imagery Dictionary, content analysis
Ogden’s basic English words

How to get the library up and running

Put the en library folder in the same folder as your script so PlotDevice can find the library. You can also put it in ~/Library/Application Support/PlotDevice/. It takes some time to load all the data the first time.

import en

Categorise words as nouns, verbs, numbers and more

The is_number() command returns True when the given value is a number:

print en.is_number(12)
print en.is_number("twelve")
>>> True
>>> True

The is_noun() command returns True when the given string is a noun. You can also check for is_verb(), is_adjective() and is_adverb():

print en.is_noun("banana")
>>> True

The is_tag() command returns True when the given string is a tag, for example HTML or XML.

The is_html_tag() command returns True when the string is a HTML tag.

Categorise words as emotional, persuasive or connective

The is_basic_emotion() command returns True if the given word expresses a basic emotion (anger, disgust, fear, joy, sadness, surprise):

print en.is_basic_emotion("cheerful")
>>> True

The is_persuasive() command returns True if the given word is a ‘magic’ word (you, money, save, new, results, health, easy, ...):

print en.is_persuasive("money")
>>> True

The is_connective() command returns True if the word is a connective (nevertheless, whatever, secondly, ... and words like I, the, own, him which have little semantical value):

print en.is_connective("but")
>>> True

Converting between numbers and words

The number.ordinal() command returns the ordinal of the given number, 100 yields 100th, 3 yields 3rd and twenty-one yields twenty-first:

print en.number.ordinal(100)
print en.number.ordinal("twenty-one")
>>> 100th
>>> twenty-first

The number.spoken() command writes out the given number:

print en.number.spoken(25)
>>> twenty-five

Quantification of numbers and lists

The number.quantify() command quantifies the given word:

print en.number.quantify(10, "chicken")
print en.number.quantify(800, "chicken")
>>> a number of chickens
>>> hundreds of chickens

The list.conjunction() command quantifies a list of words. Notice how goose is correctly pluralized and duck has the right article.

farm = ["goose", "goose", "chicken", "chicken", "chicken"]
print en.list.conjunction(farm)
>>> several chickens, a pair of geese and a duck

You can also quantify the types of things in the given list, class or module:

print en.list.conjunction((1,2,3,4,5), generalize=True)
print en.list.conjunction(en, generalize=True)
>>> several integers
>>> a number of modules, a number of functions, a number of strings,
>>> a pair of lists, a pair of dictionaries, an en verb, an en sentence,
>>> an en number, an en noun, an en list, an en content, an en adverb,
>>> an en adjective, a None type and a plotdevice graphics cocoa Context class

Indefinite article: a or an

The noun.article() returns the noun with its indefinite article:

print en.noun.article("university")
print en.noun.article("owl")
print en.noun.article("hour")
>>> a university
>>> an owl
>>> an hour

Pluralization and singularization of nouns

The noun.plural() command pluralizes the given noun:

print en.noun.plural("child")
print en.noun.plural("kitchen knife")
print en.noun.plural("wolf")
print en.noun.plural("part-of-speech")
>>> children
>>> kitchen knives
>>> wolves
>>> parts-of-speech

You can also do adjective.plural().

An optional classical parameter is True by default and determines if either classical or modern inflection is used (e.g. classical pluralization of octopus yields octopodes instead of octopuses).

The noun.singular() command singularizes the given plural:

print en.noun.singular("people")
>>> person

Emotional value of a word

The noun.is_emotion() guesses whether the given noun expresses an emotion by checking if there are synonyms of the word that are basic emotions. Returns True or False by default.

print en.noun.is_emotion("anger")
>>> True

Or you can return a string which provides some information with the boolean=False parameter.

print en.adjective.is_emotion("anxious", boolean=False)
>>> fear

An additional optional parameter shallow=True speeds up the lookup process but doesn’t check as many synonyms. You can also use verb.is_emotion(), adjective.is_emotion() and adverb.is_emotion().

WordNet glossary, synonyms, antonyms, components

WordNet describes semantic relations between synonym sets.

The noun.gloss() command returns the dictionary description of a word:

print en.noun.gloss("book")
>>> a written work or composition that has been published (printed on pages
>>> bound together); "I am reading a good book on economics"

A word can have multiple senses, for example ‘tree’ can mean a tree in a forest but also a tree diagram, or a person named Sir Herbert Beerbohm Tree:

print en.noun.senses("tree")
>>> [['tree'], ['tree', 'tree diagram'], ['Tree', 'Sir Beerbohm Tree']]

print en.noun.gloss("tree", sense=1)
>>> a figure that branches from a single root; "genealogical tree"

The noun.lexname() command returns a categorization for the given word:

print en.noun.lexname("book")
>>> communication

The noun.hyponym() command return examples of the given word:

print en.noun.hyponym("vehicle")
>>> [['bumper car', 'Dodgem'], ['craft'], ['military vehicle'], ['rocket',
>>>  'projectile'], ['skibob'], ['sled', 'sledge', 'sleigh'], ['steamroller',
>>>  'road roller'], ['wheeled vehicle']]

print en.noun.hyponym("tree", sense=1)
>>> [['cladogram'], ['stemma']]

The noun.hypernym() command return abstractions of the given word:

print en.noun.hypernym("earth")
print en.noun.hypernym("earth", sense=1)
>>> [['terrestrial planet']]
>>> [['material', 'stuff']]

You can also execute a deep query on hypernyms and hyponyms. Notice how returned values become more and more abstract:

print en.noun.hypernyms("vehicle", sense=0)
>>> [['vehicle'], ['conveyance', 'transport'],
>>>  ['instrumentality', 'instrumentation'],
>>>  ['artifact', 'artefact'], ['whole', 'unit'],
>>>  ['object', 'physical object'],
>>>  ['physical entity'], ['entity']]

The noun.holonym() command returns components of the given word:

print en.noun.holonym("computer")
>>> [['busbar', 'bus'], ['cathode-ray tube', 'CRT'],
>>>  ['central processing unit', 'CPU', 'C.P.U.', 'central processor',
>>>   'processor', 'mainframe'] ...

The noun.meronym() command returns the collection in which the given word can be found:

print en.noun.meronym("tree")
>>> [['forest', 'wood', 'woods']]

The noun.antonym() returns the semantic opposite of the word:

print en.noun.antonym("black")
>>> [['white', 'whiteness']]

Find out what two words have in common:

print en.noun.meet("cat", "dog", sense1=0, sense2=0)
>>> [['carnivore']]

The noun.absurd_gloss() returns an absurd description for the word:

print en.noun.absurd_gloss("typography")
>>> a business deal on a trivial scale

The return value of a WordNet command is usually a list containing other lists of related words. You can use the en.list.flatten() command to flatten the list:

print en.list.flatten(en.noun.senses("tree"))
>>> ['tree', 'tree', 'tree diagram', 'Tree', 'Sir Herbert Beerbohm Tree']

If you want a list of all nouns/verbs/adjectives/adverbs there’s the wordnet.all_nouns(), wordnet.all_verbs() ... commands:

print len(en.wordnet.all_nouns())
>>> 117096

All of the commands shown here for nouns are also available for verbs, adjectives and adverbs, en.verb.hypernyms(’run’), en.adjective.gloss(’beautiful’) etc. are valid commands.

Verb conjugation

PlotDevice English Linguistics knows the verb tenses for about 10000 English verbs.

The verb.infinitive() command returns the infinitive form of a verb:

print en.verb.infinitive("swimming")
>>> swim

The verb.present() command returns the present tense for the given person. Known values for person are 1, 2, 3, ‘1st’, ‘2nd’, ‘3rd’, ‘plural’, ‘*’. Just use the one you like most.

print en.verb.present("gave")
print en.verb.present("gave", person=3, negate=False)
>>> give
>>> gives

The verb.present_participle() command returns the present participle tense:

print en.verb.present_participle("be")
>>> being

Return the past tense:

print en.verb.past("give")
print en.verb.past("be", person=1, negate=True)
>>> gave
>>> wasn't

Return the past participle tense:

print en.verb.past_participle("be")
>>> been

A list of all possible tenses:

print en.verb.tenses()
>>> ['past', '3rd singular present', 'past participle', 'infinitive',
>>>  'present participle', '1st singular present', '1st singular past',
>>>  'past plural', '2nd singular present', '2nd singular past',
>>>  '3rd singular past', 'present plural']

The verb.tense() command returns the tense of the given verb:

print en.verb.tense("was")
>>> 1st singular past

Return True if the given verb is in the given tense:

print en.verb.is_tense("wasn't", "1st singular past", negated=True)
print en.verb.is_present("does", person=1)
print en.verb.is_present_participle("doing")
print en.verb.is_past_participle("done")
>>> True
>>> False
>>> True
>>> True

The verb.is_tense() command also accepts shorthand aliases for tenses: inf, 1sgpres, 2gpres, 3sgpres, pl, prog, 1sgpast, 2sgpast, 3sgpast, pastpl and ppart.

Spelling corrections

PlotDevice English Linguistics is able to perform spelling corrections based on Peter Norvig’s algorithm. The spelling corrector has an accuracy of about 70%.

The spelling.suggest() returns a list of possible corrections for a given word. The spelling.correct() command returns the corrected version (best guess) of the word.

print en.spelling.suggest("comptuer")
>>> ['computer']

Shallow parsing, the grammatical structure of a sentence

PlotDevice English Linguistics is able to do sentence structure analysis using a combination of Jason Wiener’s tagger and NLTK’s chunker. The tagger assigns a part-of-speech tag to each word in the sentence using a (Brill’s) lexicon. A postag is something like NN or VBP marking words as nouns, verbs, determiners, pronouns, etc. The chunker is then able to group syntactic units in the sentence. A syntactic unit is, for example, a determiner followed by adjectives followed by a noun: the tasty little chicken is a syntactic unit.

The sentence.tag() command tags the given sentence. The return value is a list of (word, tag) tuples. However, when you print it out it will look like a string.

print en.sentence.tag("this is so cool")
>>> this/DT is/VBZ so/RB cool/JJ

There are lots of part-of-speech tags and it takes some time getting to know them. The full list is here. The sentence.tag_description() returns a (description, examples) tuple for a given tag:

print en.sentence.tag_description("NN")
>>> ('noun, singular or mass', 'tiger, chair, laughter')

The sentence.chunk() command returns the chunked sentence:

from pprint import pprint
pprint( en.sentence.chunk("we are going to school") )
>>> [['SP',
>>>   ['NP', ('we', 'PRP')],
>>>   ['AP',
>>>   ['VP', ('are', 'VBP'), ('going', 'VBG'), ('to', 'TO')],
>>>   ['NP', ('school', 'NN')]]]]

Now what does all this mean?

NP: noun phrases, syntactic units describing a noun, for example: a big fish.
VP: verb phrases, units of verbs and auxillaries, for example: are going to.
AP: a verb/argument structure, a verb phrase and a noun phrase being influenced.
SP: a subject structure: a noun phrase which is the executor of a verb phrase or verb/argument structure.

A handy sentence.traverse(sentence, cmd) command lets you feed a chunked sentence to your own command chunk by chunk:

s = "we are going to school"
def callback(chunk, token, tag):
    if chunk != None :
        print en.sentence.tag_description(chunk)[0].upper()
    if chunk == None :
        print token, "("+en.sentence.tag_description(tag)[0]+")"
en.sentence.traverse(s, callback)
>>> SUBJECT PHRASE
>>> NOUN PHRASE
>>> we (pronoun, personal)
>>> VERB PHRASE AND ARGUMENTS
>>> VERB PHRASE
>>> are (verb, non-3rd person singular present)
>>> going (verb, gerund or present participle)
>>> to (infinitival to)
>>> NOUN PHRASE
>>> school (noun, singular or mass)

A even handier sentence.find(sentence, pattern) command lets you find patterns of text in a sentence:

s = "The world is full of strange and mysterious things."
print en.sentence.find(s, "JJ and JJ NN")
>>> [[('strange', 'JJ'), ('and', 'CC'),
>>>   ('mysterious', 'JJ'), ('things', 'NNS')]]

The returned list contains all chunks of text that matched the pattern. In the example above it retrieved all chunks of the form an adjective + and + an adjective + a noun. Notice that when you use something like ‘NN’ in your pattern (noun), NNS (plural nouns) are returned as well.

s = "The hairy hamsters visited the cruel dentist."
matches = en.sentence.find(s, "JJ NN")
print matches
>>> [[('hairy', 'JJ'), ('hamsters', 'NNS')],
     [('cruel', 'JJ'), ('dentist', 'NN')]]

An optional chunked parameter can be set to False to return strings instead of token/tag tuples. You can put pieces of the pattern between brackets to make them optional, or use wildcards:

s = "This makes the parser an extremely powerful tool."
print en.sentence.find(s, "(extreme*) (JJ) NN", chunked=False)
>>> ['parser', 'extremely powerful tool']

Finally, if you feel up to it you could feed the following command with a list of your own regular expression units to chunk, mine are pretty basic as I’m not a linguist.

print en.sentence.chunk_rules()

Summarisation of text to keywords

PlotDevice English Linguistics is able to strip keywords from a given text.

en.content.keywords(txt, top=10, nouns=True, singularize=True, filters=[])

The content.keywords() command guesses a list of words that frequently occur in the given text. The return value is a list (length defined by top) of (count, word) tuples. When nouns is True, returns only nouns. The command furthermore ignores connectives, numbers and tags. When singularize is True, attempts to singularize nouns in the text. The optional filters parameter is a list of words which the command should ignore.

So, assuming you would want to summarise web content you can do the following:

from urllib import urlopen
html = urlopen("http://news.bbc.co.uk/").read()
meta = ["news", "health", "uk", "version", "weather",
        "video", "sport", "return", "read", "help"]
print sentence_keywords(html, filters=meta)
>>> [(6, 'funeral'), (5, 'beirut'), (3, 'war'), (3, 'service'), (3, 'radio'),
>>>  (3, 'lebanon'), (3, 'islamist'), (3, 'function'), (3, 'female')]

Regressive Imagery Dictionary, psychological content analysis

PlotDevice English Linguistics is able to do psychological content analysis using John Wiseman’s Python implementation of the Regressive Imagery Dictionary. The RID asigns scores to primary, secondary and emotional process thoughts in a text.

Primary: free-form associative thinking involved in dreams and fantasy
Secondary: logical, reality-based and focused on problem solving
Emotions: expressions of fear, sadness, hate, affection, etc.

en.content.categorise(str)

The content.categorise() command returns a sorted list of categories found in the text. Each item in the list has the following properties:

item.name: the name of the category
item.count: the number of words in the text that fall into this category
item.words: a list of words from the text that fall into this category
item.type: the type of category, either ‘primary’, ‘secondary’ or ‘emotions’.

Let’s run a little test with Lucas’ Ideas from the Heart text:

txt = open("heart.txt").read()
summary = en.content.categorise(txt)
print summary.primary
print summary.secondary
print summary.emotions
>>> 0.290155440415
>>> 0.637305699482
>>> 0.0725388601036
# Lucas' text has a 64% secondary value.

# The top 5 categories in the text:
for category in summary[:5]:
    print category.name, category.count
>>> instrumental behavior 30
>>> abstraction 30
>>> social behavior 28
>>> temporal references 24
>>> concreteness 18

# Words in the top "instrumental behavior" category:
print summary[0].words
>>> ['students', 'make', 'students', 'reached', 'countless',
>>>  'student', 'workshop', 'workshop', 'students', 'finish',
>>>  'spent', 'produce', 'using', 'work', 'students', 'successful',
>>>  'workshop', 'students', 'pursue', 'skills', 'use',
>>>  'craftsmanship', 'use', 'using', 'workshops', 'workshops',
>>>  'result', 'students', 'workshops', 'student']

You can find all the categories for primary, secondary and emotional scores in the en.rid.primary, en.rid.secondary and en.rid.emotions lists.

Ogden’s basic English words

PlotDevice English Linguistics comes bundled with Charles K. Ogden list of basic English words: a set of 2000 words that can express 90% of the concepts in English. The list is stored as en.basic.words. It can be filtered for nouns, verbs, adjectives and adverbs:

print en.basic.words
>>> ['a', 'able', 'about', 'account', 'acid', 'across', ... ]

print en.basic.verbs
>>> ['account', 'act', 'air', 'amount', 'angle', 'answer', ... ]