But what if you don't want people to have your data...?
What kind of data?
Corporate finance / Political Donations / Regulations /
Anti-FOIA / Digital Marketer / ...
implementation : EASY : MEDIUM : HARD : WTF
defense : WEAK : REASONABLE : STRONG : RIDICULOUS
hack-level : SCRIPT-KIDDIE : CORPORATE : HOLLYWOOD
BlackHat1
: Disable right-click<script language="javascript">
document.onmousedown=disableclick;
status="Right Click Disabled";
function disableclick(event) {
if(event.button==2) {
alert(status);
return false;
} }
</script>
Also in this category, CSS overlays.
WhiteHat1
: Disable right-click
disableclick
and remove.
BlackHat2
: MinificationKangax HTML Minifier: removes comments, whitespace, empty elements, and much more. Also minifies javascript and CSS. Ruby wrapper: html_minifier
<div class="reveal">
<div class="slides">
<section class="vertical-stack">
<section class="vertical-slide">
<h1>Black Hat</h1>
<h1>Data Wrangling</h1>
<hr>
<h3><a href="http://thoppe.github.io/">Travis Hoppe</a> /
<a href=
"http://robertwdempsey.com/about-robert-dempsey/">Robert
Dempsey</a></h3><a href=
"https://twitter.com/metasemantic">@metasemantic</a> /
<a href="https://twitter.com/rdempsey">@rdempsey</a>
<br>
</section>
</section>
</div>
</div>
<div class=reveal><div class=slides><section class=vertical-stack><section class=vertical-slide><h1>Black Hat</h1><h1>Data Wrangling</h1><hr><h3><a href="http://thoppe.github.io/">Travis Hoppe</a> / <a href="http://robertwdempsey.com/about-robert-dempsey/">Robert Dempsey</a></h3><a href=https://twitter.com/metasemantic>@metasemantic</a> / <a href=https://twitter.com/rdempsey>@rdempsey</a><p></p><br></section></section></div></div>
WhiteHat2
: Minification
Online tools: Unminify, JS Beautifier
or
Text editor: HTML Tidy (Sublime Text)
or
Automate it: JS Beautifier
$ pip install jsbeautifier
$ js-beautify file.js
BlackHat3
: Authentication
$SESSIONS
. Give every new visitor to the site a unique ID that you control and limit access with. Bonus, restrict user-agent.
WhiteHat3
: Authentication
Create session ID's with headless browsers
and
simulate user-agents
Black Hat Warning: Poorly designed session states
(that don't clear and hold large internal variables) can DoS your server!
BlackHat4:
Data & time limits
Detection: high download rates or unusual traffic within a given timespan;
all traffic from a single client or IP address.
Rate limit individual IP addresses or a specific id.
Delay content delivery.
Return HTTP 301, 40x or 50x errors (full list)
WhiteHat4
: Data & time limits
Cycle your IP address using VPN/proxy services or TOR (see TOR spiders).
and
Slow down your scraper: Scrapy autothrottle, custom timing code
and
Change your user agent: Scrapy random user agent, custom Python code
BlackHat5
: Rendering to images
WhiteHat5
: Rendering to images
Server or desktop-based OCR software
or
Adobe Acrobat: Image -> PDF -> OCR (manual)
or
Python: OCRopus
or
Tesseract Open Source OCR Engine
BlackHat6
: JavaScript page links
Forces the user to simulate AJAX (stops headless browsers).
Combine with user sessions and data limits!
Psychology in Human-Computer Interaction by David Kieras
shows this frustrates the user with lack of control.
WhiteHat6
: JavaScript page links Don't emulate a browser, be the browser! Selenium ex.
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.google.com')
q = driver.find_element(By.NAME, 'q')
q.send_keys('Black Hat Data Wrangling')
q.submit()
BlackHat7
: Watermarking
Can watermark non images too!
WhiteHat7
: Watermarking
BlackHat8
: Honeypots & SteganographyA legal strong-arm strategy, freely give data but track its distribution.
Useful to determine ToS violations.
Poison the well! Leave fake data buried deep within the dataset.
$ identify -verbose panda.jpg
Image: panda.jpg
Format: JPEG (Joint Photographic Experts Group JFIF format)
...
Properties:
date:create: 2016-01-10T11:58:10-05:00
exif:ApertureValue: 327680/65536
exif:ColorSpace: 1
exif:DateTime: 2009:08:01 08:59:44
exif:DateTimeOriginal: 2009:07:24 04:17:22
...
import numpy as np
from scipy.ndimage import imread
from scipy.misc import imsave
jpg = imread("panda.jpg")
idx = np.random.uniform(size=jpg.shape) < 0.001
jpg[idx] += np.random.uniform(-2,2, size=idx.sum()).astype(np.uint8)
jpg[jpg<0] = 0
jpg[jpg>255] = 255
imsave("panda_new.jpg", jpg)
# Test on command line
# $ md5sum *.jpg
# bd1a44ba2111eb675e78935d4d5cc186 panda.jpg
# 672c6dbf03828ea50a70bc81e19bfd69 panda_new.jpg
Works for any lossy format (mp3, gif, etc...)
For tabular data, hide identification in NULL fields that can be easily removed.
Perturb date-times by seconds in data records and save the offset.
If a bot or persistent downloader is identified, feed them faulty data.
Continually degrade image quality sent as function of DL's.
Remove rows, or return records not found with increasing frequency.
WhiteHat8
: Honeypots & Steganography Download data multiple times from different origins.
Run diff
commands to suss out data that changes by IP and user.
Sanitize data by rejecting fields and entries that change with alternative DLs.
Modify image to remove steganography (apply same trick twice!)
BlackHat9
: Remove markup metadata
Organized webpage = Organized data = Easy rip
Eschew all user design and layer components dynamically.
Example: http://arngren.net/
Remove markup. You can't rip what you can't see.
<div class="author">
<div class="firstname">Preston </div>
<div class="lastname"> Garvey </div>
<div>
<div class="author">
<div class="firstname">Piper </div>
<div class="lastname"> Wright </div>
<div>
<!-- Remove all class and id labels, like this -->
<div style="font-weight: bold;">
Preston Garvey </br>
Piper Wright
</div>
WhiteHat9
: Remove markup
Rare in the wild as this makes web development a nightmare.
often found when dev's use lazy CMS...
Removing meta data slows users down, but syntax rules can be written per item:
html = '''
<div style="font-weight: bold;">
Preston Garvey </br>
Piper Wright
</div>'''
import bs4
soup = bs4.BeautifulSoup(html,'lxml')
text = soup.div.text
names = text.strip().split('\n')
keys = "firstname", "lastname"
data = [dict(zip(keys,x.split())) for x in names]
print data
# [{'lastname': u'Garvey', 'firstname': u'Preston'}, {'lastname': u'Wright', 'firstname': u'Piper'}]
BlackHat10
: HTML obfuscationEncode everything with HTML character codes and insert random benign HTML.
Start with this:
This is a string of text
Encode to this:
This is a<u></u> s
<i></i>tri<u></u>ng<i></i> <u></u>of text
'View Source' shows this:
<p>This page
 is<i></i> <u></u>mean<b></b>
WhiteHat10
: HTML obfuscation Use the Selenium Web Driver
or
BlackHat11
: Serving HTML as PDF
Eschew style conventions and use multi-columns!
WhiteHat11
: Serving HTML as PDF
Use OCR to extract text and images from the text
or
Tabula to extract tabular data
BlackHat12
: Text remappingAlter text from visual display:
Alter the text as it is copied. JSfiddle example
function addLink() {
//Get the selected text and append the extra info
var selection = window.getSelection(),
pagelink = '<br /><br /> Read more at: ' + document.location.href,
copytext = selection + pagelink,
newdiv = document.createElement('div');
//hide the newly created container
newdiv.style.position = 'absolute';
newdiv.style.left = '-99999px';
//insert the container, fill it with the extended text, and define the new selection
document.body.appendChild(newdiv);
newdiv.innerHTML = copytext;
selection.selectAllChildren(newdiv);
window.setTimeout(function () {
document.body.removeChild(newdiv);
}, 100);
}
document.addEventListener('copy', addLink);
simple text below right?
TgCRT3Qg3RT7SQNdsFATBsh8T3TVWKaKeTMgIayRwzhurStNVKkXZV
TRAVIS
to TgCRT3Qg3RT7SQNdsFATBsh8T3TVWKaKeTMgIayRwzhurS
<p class="codeblock">
T
<span style="position: absolute; left: -100px; top: -100px">gCRT3Qg3</span>
R
<span style="position: absolute; left: -100px; top: -100px">T7SQNdsF</span>
A
<span style="position: absolute; left: -100px; top: -100px">TBsh8T3T</span>
V
<span style="position: absolute; left: -100px; top: -100px">WKaKeTMg</span>
I
<span style="position: absolute; left: -100px; top: -100px">ayRwzhur</span>
S
<span style="position: absolute; left: -100px; top: -100px">tNVKkXZV</span>
</p>
Any data payload can be inserted here (e.g. copyright claims, point of origin, etc...)
Render document to PDF and remap fonts per document for protected data.
WTH? How does it work?
A PDF is a collection of symbols drawn on a page. Draw `c` here, draw `a` there, etc. A PDF reader only knows what a letter is because it maps to a specific character code in the font. Simply create a new font that lies about its mapping.
Multiple fonts can be used to improve the "encryption" process,
one font per character gives a one-time pad!
WhiteHat12
: Text remapping For Javascript remapping use a headless browser. For hidden spans, learn and write custom rules to remove the offending page elements. For font remapping...
Got any more Black Hat Hacks? Let us know!
#blackhatdata
/ @metasemantic / @rdempsey