Musings, Rants and Ponderings Of A DB Architect: Summer of code 2017: Python, Day 44 Frequency of words in the book Moby Dick

Monday, July 31, 2017

Summer of code 2017: Python, Day 44 Frequency of words in the book Moby Dick

As explained in my Summer of code 2017: Python post I decided to pick up Python

This is officially day 44. Today I wanted to see if I could get a python script to run and return me all the words and their occurrences in the book Moby Dick

Some interesting things you might want to know:

How many time is Moby used in the book?
How many distinct words in total?
How many words are used only once?
What are the top 20 most used words?

If you are interested in how many times Moby is in the book... here is the answer

As you can see Moby is in the book 90 times.

Ok so let's get started, if you want to follow along, you will need the Moby Dick book. Since Moby Dick is in the public domain, you can download the book for free. You can get it from project Gutenberg, the link is here: http://www.gutenberg.org/ebooks/2701

Make sure to grab the Plain Text UTF-8 version

In Python, what do we need to get the words an their counts? We need a function that will store the text of the book in a variable

It will look like the following

with open(r'C:\Downloads\MobyDick.txt', 'r', encoding="utf8") as myfile:
    doc=myfile.read().replace('\n', ' ')

As you can see we are also stripping off the line feed by replacing \n with ''. Make sure encoding is in utf8 otherwise you will get errors like the one below

Traceback (most recent call last):
File "c:\MapReduce.py", line 11, in
doc=myfile.read().replace('\n', ' ')
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7237: character maps to

Now that we have the file in a variable, we need to count the words, here is what that function will look like

def CountWords(text):
    output =''.join(c.lower() if c.isalpha() else ' ' for c in text)
    frequencies = {}
    for word in output.split():
        frequencies[word] = frequencies.get(word, 0) + 1
    return frequencies

What the function does is strips all non alpha characters, after that it loops through all the words created by the split function and increments the counter. The function then returns this key value pair

In order to print the output on more than 1 line, we will use pprint, I already posted about print here: Summer of code 2017: Python, Pretty printing with pprint in Python

from pprint import pprint as pp

Finally, we need to reverse the order, sort the output and limit the output to n numbers, I have chosen 500 here

pp(sorted(CountWords(doc).items(), key=lambda x: (-x[1], x[0]))[:500])

All of this together will look like this, make sure to change the path to the file to match your computer's path

def CountWords(text):
    output =''.join(c.lower() if c.isalpha() else ' ' for c in text)
    frequencies = {}
    for word in output.split():
        frequencies[word] = frequencies.get(word, 0) + 1
    return frequencies
 
 
with open(r'C:\Downloads\MobyDick.txt', 'r') as myfile:
    doc=myfile.read().replace('\n', ' ')
 
 
from pprint import pprint as pp
 
pp(sorted(CountWords(doc).items(), key=lambda x: (-x[1], x[0]))[:500])

Now let's ask those questions again, but this time we will have the answers as well

What are the 20 most used words?
Here we go

('the', 14718),
('of', 6743),
('and', 6518),
('a', 4807),
('to', 4707),
('in', 4242),
('that', 3100),
('it', 2536),
('his', 2532),
('i', 2127),
('he', 1900),
('s', 1825),
('but', 1823),
('with', 1770),
('as', 1753),
('is', 1751),
('for', 1646),
('was', 1646),
('all', 1545),
('this', 1443)

How many distinct words in total?
17,148 distinct words (you need to remove the limit in order to get the full set back, just remove [:500])

How many words are used only once?
There are 7416 words used only once, here are some of them starting with the letter z (you need to remove the limit in order to get the full set back, just remove [:500])

('zag', 1),
('zay', 1),
('zealanders', 1),
('zephyr', 1),
('zeuglodon', 1),
('zig', 1),
('zip', 1),
('zogranda', 1),
('zoroaster', 1)

How many times does captain Ahab's name appear in the book?
Captain Ahab's name appears 517 times in the book

And I will leave you with the first 100 most used words in the book

('the', 14718),
('of', 6743),
('and', 6518),
('a', 4807),
('to', 4707),
('in', 4242),
('that', 3100),
('it', 2536),
('his', 2532),
('i', 2127),
('he', 1900),
('s', 1825),
('but', 1823),
('with', 1770),
('as', 1753),
('is', 1751),
('for', 1646),
('was', 1646),
('all', 1545),
('this', 1443),
('at', 1335),
('whale', 1244),
('by', 1227),
('not', 1172),
('from', 1105),
('on', 1073),
('him', 1068),
('so', 1066),
('be', 1064),
('you', 964),
('one', 925),
('there', 871),
('or', 798),
('now', 786),
('had', 779),
('have', 774),
('were', 684),
('they', 670),
('which', 655),
('like', 647),
('me', 633),
('then', 631),
('their', 620),
('are', 619),
('some', 619),
('what', 619),
('when', 607),
('an', 600),
('no', 596),
('my', 589),
('upon', 568),
('out', 539),
('man', 527),
('up', 526),
('into', 523),
('ship', 519),
('ahab', 517),
('more', 509),
('if', 501),
('them', 474),
('ye', 473),
('we', 470),
('sea', 455),
('old', 452),
('would', 432),
('other', 431),
('been', 415),
('over', 409),
('these', 406),
('will', 399),
('though', 384),
('its', 382),
('down', 379),
('only', 378),
('such', 376),
('who', 366),
('any', 364),
('head', 348),
('yet', 345),
('boat', 337),
('long', 334),
('time', 334),
('her', 332),
('captain', 329),
('here', 325),
('do', 324),
('very', 323),
('about', 318),
('still', 312),
('than', 311),
('chapter', 308),
('great', 307),
('those', 307),
('said', 305),
('before', 301),
('two', 298),
('has', 294),
('must', 293),
('t', 291),
('most', 285)

Monday, July 31, 2017

Summer of code 2017: Python, Day 44 Frequency of words in the book Moby Dick

No comments: