As explained in my Summer of code 2017: Python post I decided to pick up Python
This is officially day 44. Today I wanted to see if I could get a python script to run and return me all the words and their occurrences in the book Moby Dick
Some interesting things you might want to know:
How many time is Moby used in the book?
How many distinct words in total?
How many words are used only once?
What are the top 20 most used words?
If you are interested in how many times Moby is in the book... here is the answer
As you can see Moby is in the book 90 times.
Ok so let's get started, if you want to follow along, you will need the Moby Dick book. Since Moby Dick is in the public domain, you can download the book for free. You can get it from project Gutenberg, the link is here: http://www.gutenberg.org/ebooks/2701
Make sure to grab the Plain Text UTF-8 version
In Python, what do we need to get the words an their counts? We need a function that will store the text of the book in a variable
It will look like the following
with open(r'C:\Downloads\MobyDick.txt', 'r', encoding="utf8") as myfile: doc=myfile.read().replace('\n', ' ')
As you can see we are also stripping off the line feed by replacing \n with ''. Make sure encoding is in utf8 otherwise you will get errors like the one below
Traceback (most recent call last):
File "c:\MapReduce.py", line 11, in
doc=myfile.read().replace('\n', ' ')
File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7237: character maps to
Now that we have the file in a variable, we need to count the words, here is what that function will look like
def CountWords(text): output =''.join(c.lower() if c.isalpha() else ' ' for c in text) frequencies = {} for word in output.split(): frequencies[word] = frequencies.get(word, 0) + 1 return frequencies
What the function does is strips all non alpha characters, after that it loops through all the words created by the split function and increments the counter. The function then returns this key value pair
In order to print the output on more than 1 line, we will use pprint, I already posted about print here: Summer of code 2017: Python, Pretty printing with pprint in Python
from pprint import pprint as pp
Finally, we need to reverse the order, sort the output and limit the output to n numbers, I have chosen 500 here
pp(sorted(CountWords(doc).items(), key=lambda x: (-x[1], x[0]))[:500])
All of this together will look like this, make sure to change the path to the file to match your computer's path
def CountWords(text): output =''.join(c.lower() if c.isalpha() else ' ' for c in text) frequencies = {} for word in output.split(): frequencies[word] = frequencies.get(word, 0) + 1 return frequencies with open(r'C:\Downloads\MobyDick.txt', 'r') as myfile: doc=myfile.read().replace('\n', ' ') from pprint import pprint as pp pp(sorted(CountWords(doc).items(), key=lambda x: (-x[1], x[0]))[:500])
Now let's ask those questions again, but this time we will have the answers as well
What are the 20 most used words?
Here we go
('the', 14718),
('of', 6743),
('and', 6518),
('a', 4807),
('to', 4707),
('in', 4242),
('that', 3100),
('it', 2536),
('his', 2532),
('i', 2127),
('he', 1900),
('s', 1825),
('but', 1823),
('with', 1770),
('as', 1753),
('is', 1751),
('for', 1646),
('was', 1646),
('all', 1545),
('this', 1443)
How many distinct words in total?
17,148 distinct words (you need to remove the limit in order to get the full set back, just remove [:500])
How many words are used only once?
There are 7416 words used only once, here are some of them starting with the letter z (you need to remove the limit in order to get the full set back, just remove [:500])
('zag', 1),
('zay', 1),
('zealanders', 1),
('zephyr', 1),
('zeuglodon', 1),
('zig', 1),
('zip', 1),
('zogranda', 1),
('zoroaster', 1)
How many times does captain Ahab's name appear in the book?
Captain Ahab's name appears 517 times in the book
And I will leave you with the first 100 most used words in the book
('the', 14718),
('of', 6743),
('and', 6518),
('a', 4807),
('to', 4707),
('in', 4242),
('that', 3100),
('it', 2536),
('his', 2532),
('i', 2127),
('he', 1900),
('s', 1825),
('but', 1823),
('with', 1770),
('as', 1753),
('is', 1751),
('for', 1646),
('was', 1646),
('all', 1545),
('this', 1443),
('at', 1335),
('whale', 1244),
('by', 1227),
('not', 1172),
('from', 1105),
('on', 1073),
('him', 1068),
('so', 1066),
('be', 1064),
('you', 964),
('one', 925),
('there', 871),
('or', 798),
('now', 786),
('had', 779),
('have', 774),
('were', 684),
('they', 670),
('which', 655),
('like', 647),
('me', 633),
('then', 631),
('their', 620),
('are', 619),
('some', 619),
('what', 619),
('when', 607),
('an', 600),
('no', 596),
('my', 589),
('upon', 568),
('out', 539),
('man', 527),
('up', 526),
('into', 523),
('ship', 519),
('ahab', 517),
('more', 509),
('if', 501),
('them', 474),
('ye', 473),
('we', 470),
('sea', 455),
('old', 452),
('would', 432),
('other', 431),
('been', 415),
('over', 409),
('these', 406),
('will', 399),
('though', 384),
('its', 382),
('down', 379),
('only', 378),
('such', 376),
('who', 366),
('any', 364),
('head', 348),
('yet', 345),
('boat', 337),
('long', 334),
('time', 334),
('her', 332),
('captain', 329),
('here', 325),
('do', 324),
('very', 323),
('about', 318),
('still', 312),
('than', 311),
('chapter', 308),
('great', 307),
('those', 307),
('said', 305),
('before', 301),
('two', 298),
('has', 294),
('must', 293),
('t', 291),
('most', 285)