How I Computer Science



Travis Hoppe, PhD

@metasemantic



Postdoctoral Fellow, National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD

C.S. (mis)conceptions

What I thought it was...
What it actually is...

Why I do it...

Outline


Science


Machine Learning


Public Relations

Science

My Background: PhD, MS Physics, BS Mathematics



Protein "Physics"

Folding, interfaces, aggregation, electrostatics, statistical mechanics, ...


Protein Structure

Primary structure (sequence)

GSIGAASMEF CFDVFKELKV HHANENIFYC PIAIMSALAM VYLGAKDSTR TQINKVVRFD KLPGFGDEIE AQCGTSVNVH 
SSLRDILNQI TKPNDVYSFS LASRLYAEER YPILPEYLQC VKELYRGGLE PINFQTAADQ ARELINSWVE SQTNGIIRNV 
LQPSSVDSQT AMVLVNAIVF KGLWEKAFKD EDTQAMPFRV TEQESKPVQM MYQIGLFRVA SMASEKMKIL ELPFASGTMS 
MLVLLPDEVS GLEQLESIIN FEKLTEWTSS NVMEERKIKV YLPRMKMEEK YNLTSVLMAM GITDVFSSSA NLSGISSAES 
LKISQAVHAA HAEINEAGRE VVGGAEAGVD AASVSEEFRA DHPFLFCIKH IATNAVLFFG RCVSP
Secondary structure
helices [red], sheets [blue
Tertiary structure
3D structure
Higher-order structure
complexes, aggregation


Ovalbumin, Egg white protein PDB:1OVA, Crystal Structure, Carrell et al., J. Mol. Biol. (1991)
SEM Aggregate structure, Zabik et al., J. Poul. Sci. (1980)

Protein interactions


Folding
Binding, dimerization
Aggregation, Fibril formation

Protein interactions in a
crowded environment


How do we do it?

Statistical Potentials: Residue-residue interactions




Potentials constructed from Top 8000 Protein Database, Richardson Group

Residue-residue interaction matrix, MJ


Other statistical potentials: Tanaka and Scheraga (1976), Spil (1990), Miyazawa and Jernigan (1996),
Betancourt and Thirumalai (1999), Skolnick, Kolinski and Ortiz (2000)

MJ matrix reveals biophysical structure

H (hydrophobic), P (polar), C (charged)

Higher order structure

Phase separations lead to sudden changes in liquid structure.

Leibler, Nature 2004
Tanaka, Phys. Rev. E 2005

How do we model many protein-protein interactions?
Can we predict aggregates from experimental structure?

Human serum albumin
PDB:1AO6
Ovalbumin
PDB:1OVA
Lysozyme
PDB:1W6Z
Bovine Serum Albumin
PDB:3V03

Where does CS come into play?


Be able to say what is possible, and what isn't!


Algorithmic design, ex. linear algebra, molecular dynamics...


Hardware design, specialized hardware, ex. Anton, GRAPE.


Predicting run-time (non-trivial at model stage!).


Scaling up!

Machine Learning



Meet The Man Who Gamed Reddit With A Bot



The goal


Train a machine to find

new & interesting things


Requires a corpus of interesting things...

Supervised learning

r/TIL, a subreddit short for Today I Learned

Keep only Wikipedia data

Filter for consistent writing style...

Data collection


Download Wikipedia
Download all posts with score>1000 for 2013 and 2014 (~5000)
Cross-reference each post to the correct Wikipedia paragraph
Build True positives (known TIL's)
Build Decoys (other paragraphs in TIL's)
Build unknown samples (rest of Wikipedia*)


from python import science

sqlite3, requests, bs4, pandas, numpy, scikit-learn,
gensim, praw, wikipedia, nltk, stemmming.porter2


*Assume that most of Wikipedia isn't interesting...

Data Wrangling

Tokenize

>> "Good muffins cost $3.88\n in New York"
['Good', 'muffins', 'cost', 'TOKEN_MONEY', 'in', 'New', 'York', 'TOKEN_EOS']

Remove "stop words"

>> "I sat on the rock"
['I', 'sat', 'on', 'rock']

Stem words

>> stem("factionally")
'faction'

"Entropy" vectors

counts the uniqueness of each word to the rest of the entry,
local TF-IDF (term frequency-inverse document frequency)

Feature generation

Used Word2Vec (developed by Google),
weighted by local article TF-IDF


>>> model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
>>> model['computer']  # raw numpy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

Uses far fewer features to store relationships between words!

Modeling training

Used Extremely Randomized Trees, variant of Random Tree classifier.

Training classifier
Test Accuracy: 0.878;    Test Accuracy on TP: 0.116;   Test Accuracy on TN: 0.998
Receiver Operating Characteristic

Does it work?

yes! look at all that sweet front-page karma...


TIL The Founder Of Japans Mcdonalds Stated | 4726
TIL Mike Kurtz An American Burglar Found Out That | 4123
TIL A Woman That Reported 100 Incidents Of | 2899
TIL During The Sentencing Of His War Crimes Trial | 1551
TIL That Art Spiegelman The Creator Of Maus A | 1144
TIL That Once Officially Labeled As Retarded | 640
TIL Before World War Ii It Was Very Rare For | 498
TIL That A Study Showed Those With A Distressed | 142
TIL Frankie Fraser A Notorious English Gangster | 135
TIL Rafael Quintero A Mexican Drug Trafficker | 68
...

AI vs. Human (Turing test pt. 1)

I can do (almost) anything you can do better...

Turing test pt. 2

After three months and 60 submissions, I revealed to Reddit
the true nature of /u/possible_urban_king.
The account was promptly banned from r/todayIlearned ...
thanks anonymous moderator for helping prove the test!

The Turing test is a necessary but not

sufficient test for artificial intelligence.

Artificial Intelligence
Machine Learning

Where does CS come into play?


Natural language parsing, NLP.


Supervised and unsupervised learning.


Knowing the right algorithm and its limitations...


Validation and statistics.

Public Relations

Build a portfolio

Network with others

Before you start ...
and once you get out there...

Advertise yourself!


Learn from others!


... computer science is more than just code ...

Learn from others / help others!

Stack Overflow

Challenge yourself!


PE: Math challenges that require coding.
Kaggle: Machine learning for profit!

TC: Mini-Hackathons and prizes!
HR: Used in interviews.

Share your code!

github

Meetups and Hackathons

Meetup

Shameless plug and Extra Credit!

DC Hack && Tell

Next event October 13th!

Thanks, you!



For class participation credit, fill out this questionnaire:


Presentation Review

http://bit.ly/1KVprYC

permalink