Download all popular posts with score>1000
for 2013 and 2014 (~5000)
Download Wikipedia
Cross-reference each post to the correct Wikipedia paragraph
Build True positives (known TIL's)
Build Decoys (other paragraphs in TIL's)
Build unknown samples (rest of Wikipedia*)
sqlite3
, requests
, bs4
, pandas
, numpy
, scikit-learn
,gensim
, praw
, wikipedia
, nltk
, stemmming.porter2
>> "Good muffins cost $3.88\n in New York"
['Good', 'muffins', 'cost', 'TOKEN_MONEY', 'in', 'New', 'York', 'TOKEN_EOS']
>> "I sat on the rock"
['I', 'sat', 'on', 'rock']
>> stem("factionally")
'faction'
TF-IDF
(term frequency-inverse document frequency)
TF-IDF
>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
>>> model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
Uses far fewer features to store relationships between words!
Training classifier
Test Accuracy: 0.878; Test Accuracy on TP: 0.116; Test Accuracy on TN: 0.998
TIL The Founder Of Japans Mcdonalds Stated | 4726
TIL Mike Kurtz An American Burglar Found Out That | 4123
TIL A Woman That Reported 100 Incidents Of | 2899
TIL During The Sentencing Of His War Crimes Trial | 1551
TIL That Art Spiegelman The Creator Of Maus A | 1144
TIL That Once Officially Labeled As Retarded | 640
TIL Before World War Ii It Was Very Rare For | 498
TIL That A Study Showed Those With A Distressed | 142
TIL Frankie Fraser A Notorious English Gangster | 135
TIL Rafael Quintero A Mexican Drug Trafficker | 68
...
/u/possible_urban_king
.