Gathered all tweets that contained the top 200 emoji.
Approximately 80,000 per hour, 13,000,000 total.
Removed exactly identical tweets.
Removed tweets that only differ by index:
"Hello baby @justinbieber Can u follow me? ♥ x37"
"Hello baby @justinbieber Can u follow me? ♥ x38"
"Hello baby @justinbieber Can u follow me? ♥ x39"
Built a pipeline for repeatable processing:
remove_mentions, remove_urls, HTML_symbols, remove_apostrophe, space_symbols, special_lowercase, replace_emoji, limit_character_subset, remove_repeated_tokens, remove_twitter_mentions_hashtags, remove_emoji_modifier
Special care: Emoji have skin tone which count as an extra character.
TIL: Fitzpatrick is the name of the skin tone scale.
word2vec
over tweets and consider emojis as a qualified "word"
word2vec
spreads vectors across the hypersphere
Sample tweets containing target emoji, compute mean w2vec of each tweet, run low order affinity propagation, cluster and interpret vectors near clusters.