This week I spent my time developing repostBot. To understand the purpose of this bot requires some knowledge about Reddit. On Reddit, post upvotes and downvotes affect the total “karma,” of whoever originally made the post. Many users are interested in increasing the total karma associated with their account over time. Because of this, an issue is developed called “reposting.” After one user posts an interesting post, be it a piece of artwork, a funny joke, or the link to a well informed article, another user will copy the content and post the exact same thing a week later in order to reap the karma rewards. This practice is generally frowned upon in the Reddit community, as it is considered lazy and reposters rarely give credit to the original post.
repostBot attempts to automatically detect these reposts and creates a comment that links back to the original article. In order to do this effectively, I needed to establish a way to measure the similarity between two posts. Title wise, this is very easy, as Reddit has a built in search feature that I can make use of, but I only want to label a submission as a repost if the posts content is also the same (for example, if I detect two submissions with the title “Funny Cat”, but they link to different images, I do not want to label one the repost of the other). Luckily, there is a library called SequenceMatcher which does exactly this. By giving it two pieces of text, it returns a number from 0-1 representing their similarity. 1 would mean that the snippets are identical, and 0 would mean that they are entirely different. As of right now, I label a post a repost of another if they share the same title and the posts’ content have a similarity rating of at least .95. This is just a temporary measure. Next week, I want to work on making a model that through reinforcement learning adjusts this number to find the optimal value for detecting reposts.