Week 1

An Introduction

Lior Hirschfeld

To write each of my bots in this IP, I have used a code library called “The Python Reddit API Wrapper” or PRAW for short. This allows me to directly control a Reddit account to browse through the site, and even create posts or comments. Although the functionality I’m looking for would technically be possible through webscraping (i.e. accessing and navigating through a website’s code), PRAW saves me an incredibly large amount of time. Instead of slogging through Reddit’s code to find exactly how I must interact with a page to create a comment, I can simply call a single method, “reply,” with PRAW to achieve the same result.

This week I used PRAW to make a bot called “JargonBot.” I was first inspired to make JargonBot, after I saw several comments on Reddit that used particularly obscure vocabulary. Almost every time, there would be at least one comment in response explaining the meaning of the difficult word that was used. I thought it would be great if there was an automatic script that performed this functionality, by detecting difficult words and defining them (for users with a smaller vocabulary or who speak English as a second language). This week I worked on creating a basic implementation of this bot. The largest challenge I had in this process was detecting which words are complex enough that they deserve to be defined. Initially, I found an online list that contained a million words in order of their rate of use. This allowed me to set a threshold, for example, 100,000, beyond which I would define all words. I quickly realized that this strategy had flaws. With this approach, I would end up defining rare conjugations of common words: For example, plural forms of nouns would all appear less common even though they are no less likely to be understood. To overcome this, I used a language processing tool called a “stemmer,” which attempts to reduce each word to its most basic form. Instead of looking up the rarity of each word I found on Reddit, I stemmed the words and associated it with the most common word containing the stem. For example, if I detected the word “electricians,” positioned at 120,000 in popularity, it would be analyzed as though I had found “electric,” the ten thousandth most common word, and therefore it would not be defined. Finally, I had to consider that different subsections of the Reddit community would speak differently. The word “anesthesiologist,” might be not be particularly rare on /r/medicine, but it would be on /r/books. To account for this discrepancy, before the bot runs on a subreddit it analyzed the 10,000 most common words used there, and will avoid defining any words on the list.