today-AI-learned

https://github.com/thoppe/today-AI-learned



Travis Hoppe

Scientist + DC Hack && Tell


/u/possible_urban_king

Artificial Redditor


The goal


Train a machine to find

new & interesting things


Requires a corpus of interesting things...

Supervised learning

r/TIL, a subreddit short for Today I Learned

Keep only Wikipedia data

Filter for consistent writing style...

Data collection


Download all popular posts with score>1000 for 2013 and 2014 (~5000)


Download Wikipedia
Cross-reference each post to the correct Wikipedia paragraph
Build True positives (known TIL's)
Build Decoys (other paragraphs in TIL's)
Build unknown samples (rest of Wikipedia*)



from python import science

sqlite3, requests, bs4, pandas, numpy, scikit-learn,
gensim, praw, wikipedia, nltk, stemmming.porter2


*Assume that most of Wikipedia isn't interesting...

Data Wrangling

Tokenize

>> "Good muffins cost $3.88\n in New York"
['Good', 'muffins', 'cost', 'TOKEN_MONEY', 'in', 'New', 'York', 'TOKEN_EOS']

Remove "stop words"

>> "I sat on the rock"
['I', 'sat', 'on', 'rock']

Stem words

>> stem("factionally")
'faction'

"Entropy" vectors

counts the uniqueness of each word to the rest of the entry,
local TF-IDF (term frequency-inverse document frequency)

Feature generation

Used Word2Vec (developed by Google), weighted by local article TF-IDF


>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
>>> model['computer']  # raw numpy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

Uses far fewer features to store relationships between words!


Also TF-IDF shows reddit is preoccupied with Hitler and Pokemon...funny story about that

Modeling training

Used Extremely Randomized Trees, variant of Random Tree classifier.

Training classifier
Test Accuracy: 0.878;    Test Accuracy on TP: 0.116;   Test Accuracy on TN: 0.998
Receiver Operating Curve

Does it work?

yes! look at all that sweet front-page karma...


TIL The Founder Of Japans Mcdonalds Stated | 4726
TIL Mike Kurtz An American Burglar Found Out That | 4123
TIL A Woman That Reported 100 Incidents Of | 2899
TIL During The Sentencing Of His War Crimes Trial | 1551
TIL That Art Spiegelman The Creator Of Maus A | 1144
TIL That Once Officially Labeled As Retarded | 640
TIL Before World War Ii It Was Very Rare For | 498
TIL That A Study Showed Those With A Distressed | 142
TIL Frankie Fraser A Notorious English Gangster | 135
TIL Rafael Quintero A Mexican Drug Trafficker | 68
...

AI vs Human (Turing test pt. 1)

I can do (almost) anything you can do better...

Turing test pt. 2

After three months and 60 submissions, I revealed to Reddit
the true nature of /u/possible_urban_king.
At which the account was promptly banned from posting in r/todayIlearned ...
thanks anonymous moderator for helping prove the test!

The Turing test is a necessary but not sufficient test for artificial intelligence.

Artificial Intelligence
Machine Learning

Thanks, you!


Meet The Man Who Gamed Reddit With A Bot