Week 4

Creating a Bot Class

Lior Hirschfeld

This week I spent the first half of my time developing machine learning functionality for repostBot. The first step in this process was to determine which variables might allow a model to predict whether a post is a repost. As reddit submissions can be of two types (link and text), I needed to consider each format differently. I decided on this:

Link posts will only be considered a repost of an older post if they share the exact same URL and a similar title. The reason that the URL must be identical and not just similar is that a similar URL does not always represent similar content. For example, two articles might be linked to on the New York Times’ website with completely different subject matter, and yet their links might be nearly the exact same. In this case, the bot doesn’t even need to use a model to detect reposts.

Text posts are more complex. There are a few variables to consider. As I mentioned in the previous week’s blog post, there is a library called SequenceMatcher which is capable of detecting the similarity between two string (groups of text). However, just because the text of two posts is near identical does not necessarily mean that I should label one the repost of the other. Consider the subreddit /r/jokes. It’s possible one person writes a short, two line joke, and the next day someone writes the same joke, but slightly changes the punchline so that it responds to, or makes fun of, the original joke (this is quite common on Reddit). The second post should not be thought of a repost, but rather, the continuation of a conversation. It is also possible for two short posts to have similar text by random change. For example, on the subreddit /r/test, there are thousands of posts with just the text “test,” and yet they are entirely unrelated to one another. Therefore, I also have the model consider post length, to disincentivize labeling short posts as reposts.

My second task of the week was to create a class that performs common bot functionality. Ideally, I should have done this originally, as it would have saved me from duplicate work on repostBot. The class currently has three methods:

  • A constructor, which initializes praw, id, and model variables.
  • updateIds, which saves the list of comment ids the bot has responded to into a pickle file.
  • updateModels, which saves each subreddit’s model into a pickle file.

Now that I have this class, both jargonBot and repostBot can import it and make use of its functionality. This makes the code cleaner, and would make it easier for me to create more bots in the future that use a similar structure.