Black Hat

Data Wrangling

Travis Hoppe / Robert Dempsey

@metasemantic / @rdempsey

Data Wranglers

Want your data!

How can you stop slow them?

White-hat data wrangler

Working hard to make your data accessible to others.

But what if you don't want people to have your data...?

Black-hat data wrangler

Working hard to make your data as inaccessible as possible.

What kind of data?

Corporate finance / Political Donations / Regulations /
Anti-FOIA / Digital Marketer / ...

Why not disconnect from the net?

When do you need a

Black Hack Data Wrangler?

You have a large amount of data

The data must be made public

Should be human readable but computer-unfriendly

Your actions should be hidden to a casual user

Presentation format

Hack quantification

implementation : EASY : MEDIUM : HARD : WTF



Table of Contents

Disable right-click [T]

Minification [R]

Authentication [T]

Data limits [R]

Rendering to images [R]

JavaScript page links [T]

Watermarking [R]

Honeypots & Stenography [T]

Remove markup metadata [T]

HTML obfuscation [R]

Serving HTML as PDF [R]

Text remapping [T]

BlackHat1: Disable right-click

implementation EASY : defense WEAK : hack-level SCRIPT-KIDDIE

<script language="javascript">
status="Right Click Disabled";
function disableclick(event) {
  if(event.button==2) {
     return false;    
} }

Also in this category, CSS overlays.

WhiteHat1: Disable right-click

Open developers console (F12), search for disableclick and remove.

Turn of javascript.

Use a headless (or mobile) browser.

BlackHat2: Minification

implementation EASY : defense WEAK : hack-level SCRIPT-KIDDIE

Kangax HTML Minifier: removes comments, whitespace, empty elements, and much more. Also minifies javascript and CSS. Ruby wrapper: html_minifier


<div class="reveal">
    <div class="slides">
        <section class="vertical-stack">
            <section class="vertical-slide">
                <h1>Black Hat</h1>
                <h1>Data Wrangling</h1>
                <h3><a href="">Travis Hoppe</a> /
                <a href=
                Dempsey</a></h3><a href=
                "">@metasemantic</a> /
                <a href="">@rdempsey</a>


<div class=reveal><div class=slides><section class=vertical-stack><section class=vertical-slide><h1>Black Hat</h1><h1>Data Wrangling</h1><hr><h3><a href="">Travis Hoppe</a> / <a href="">Robert Dempsey</a></h3><a href=>@metasemantic</a> / <a href=>@rdempsey</a><p></p><br></section></section></div></div>

WhiteHat2: Minification

De-minify the HTML using freely available tools.

Online tools: Unminify, JS Beautifier
Text editor: HTML Tidy (Sublime Text)
Automate it: JS Beautifier

$ pip install jsbeautifier
$ js-beautify file.js

BlackHat3: Authentication

implementation MEDIUM : defense REASONABLE : hack-level CORPORATE

not RESTful?

Implement visitor control via $SESSIONS. Give every new visitor to the site a unique ID that you control and limit access with. Bonus, restrict user-agent.


Require all meaningful data requests to go through OAuth2, cumbersome for new-comers and direct control over the data distribution.

WhiteHat3: Authentication

Create session ID's with headless browsers
simulate user-agents

Black Hat Warning: Poorly designed session states
(that don't clear and hold large internal variables) can DoS your server!

BlackHat4: Data & time limits

implementation MEDIUM : defense REASONABLE : hack-level CORPORATE

Detection: high download rates or unusual traffic within a given timespan;
all traffic from a single client or IP address.

Rate limit individual IP addresses or a specific id.
Delay content delivery.
Return HTTP 301, 40x or 50x errors (full list)

WhiteHat4: Data & time limits

Cycle your IP address using VPN/proxy services or TOR (see TOR spiders).
Slow down your scraper: Scrapy autothrottle, custom timing code
Change your user agent: Scrapy random user agent, custom Python code

BlackHat5: Rendering to images

implementation MEDIUM : defense STRONG : hack-level CORPORATE

Text to Image

PHP Text to Image / ImageMagick
Draw text onto an HTML5 canvas using JavaScript / use the HTML5 canvasElement.toDataURL element

WhiteHat5: Rendering to images

Server or desktop-based OCR software
Adobe Acrobat: Image -> PDF -> OCR (manual)
Python: OCRopus
Tesseract Open Source OCR Engine

BlackHat6: JavaScript page links

implementation MEDIUM : defense REASONABLE : hack-level CORPORATE
Infinite pagination/scroll. Ex. Dribble

Forces the user to simulate AJAX (stops headless browsers).
Combine with user sessions and data limits!

Psychology in Human-Computer Interaction by David Kieras
shows this frustrates the user with lack of control.

Image from visualhierarchy

WhiteHat6: JavaScript page links

Don't emulate a browser, be the browser! Selenium ex.

from selenium import webdriver
from import By
driver = webdriver.Firefox()

q = driver.find_element(By.NAME, 'q')
q.send_keys('Black Hat Data Wrangling')

BlackHat7: Watermarking

implementation EASY : defense STRONG : hack-level SCRIPT-KIDDIE

Easy mode, simple IP protection

Easy to remove.
Requires time to remove, not automated.

Can watermark non images too!

WhiteHat7: Watermarking

Simple removal

Crop the picture in any photo editor
Use the restoration function in Inpaint: $20

More complex removal

"Content Aware Fill" in Photoshop


Content Aware Fill in Photoshop

BlackHat8: Honeypots & Steganography

implementation HARD : defense RIDICULOUS : hack-level HOLLYWOOD
Steganography: embed data to identify and track IP/credentials.

A legal strong-arm strategy, freely give data but track its distribution.

Useful to determine ToS violations.

Poison the well! Leave fake data buried deep within the dataset.

Image steganography

Hide data in the EXIF header (obvious place, easy to remove), ExifTool

Kevin Dooley, Flickr

$ identify -verbose panda.jpg 

 Image: panda.jpg
  Format: JPEG (Joint Photographic Experts Group JFIF format)
    date:create: 2016-01-10T11:58:10-05:00
    exif:ApertureValue: 327680/65536
    exif:ColorSpace: 1
    exif:DateTime: 2009:08:01 08:59:44
    exif:DateTimeOriginal: 2009:07:24 04:17:22

Image steganography

Map post-filter md5sum to user data (not resistant to image changes).
Impossible for user to know what is being stored!

import numpy as np
from scipy.ndimage import imread
from scipy.misc import imsave

jpg = imread("panda.jpg")
idx = np.random.uniform(size=jpg.shape) < 0.001
jpg[idx] += np.random.uniform(-2,2, size=idx.sum()).astype(np.uint8)
jpg[jpg<0] = 0
jpg[jpg>255] = 255
imsave("panda_new.jpg", jpg)
# Test on command line
# $ md5sum *.jpg
# bd1a44ba2111eb675e78935d4d5cc186  panda.jpg
# 672c6dbf03828ea50a70bc81e19bfd69  panda_new.jpg

General steganography

Works for any lossy format (mp3, gif, etc...)
For tabular data, hide identification in NULL fields that can be easily removed.
Perturb date-times by seconds in data records and save the offset.


If a bot or persistent downloader is identified, feed them faulty data.
Continually degrade image quality sent as function of DL's.
Remove rows, or return records not found with increasing frequency.

WhiteHat8: Honeypots & Steganography

Download data multiple times from different origins.

Run diff commands to suss out data that changes by IP and user.

Sanitize data by rejecting fields and entries that change with alternative DLs.

Modify image to remove steganography (apply same trick twice!)

BlackHat9: Remove markup metadata

implementation HARD : defense REASONABLE : hack-level CORPORATE

Two ways:

1. Break the standard UX design.

2. Remove proper HTML/CSS markup.

Organized webpage = Organized data = Easy rip

Eschew all user design and layer components dynamically.

Remove markup. You can't rip what you can't see.

<div class="author">
    <div class="firstname">Preston </div>
    <div class="lastname"> Garvey  </div>

<div class="author">
    <div class="firstname">Piper </div> 
    <div class="lastname"> Wright  </div>

<!-- Remove all class and id labels, like this --> 
<div style="font-weight: bold;">
Preston Garvey </br>
Piper Wright

WhiteHat9: Remove markup

Rare in the wild as this makes web development a nightmare.
often found when dev's use lazy CMS...

Removing meta data slows users down, but syntax rules can be written per item:

html = '''
<div style="font-weight: bold;">
Preston Garvey </br>
Piper Wright

import bs4
soup = bs4.BeautifulSoup(html,'lxml')
text = soup.div.text
names = text.strip().split('\n')

keys = "firstname", "lastname"
data = [dict(zip(keys,x.split())) for x in names]

print data
# [{'lastname': u'Garvey', 'firstname': u'Preston'}, {'lastname': u'Wright', 'firstname': u'Piper'}]

BlackHat10: HTML obfuscation

implementation EASY : defense REASONABLE : hack-level SCRIPT-KIDDIE

Encode everything with HTML character codes and insert random benign HTML.

Start with this:

This is a string of text

Encode to this:


'View Source' shows this:


WhiteHat10: HTML obfuscation

Use the Selenium Web Driver

  1. 1. Create a headless web browser
  2. 2. Open the page
  3. 3. Take a screenshot of the page
  4. 4. Use OCR to extract the text from the screenshot


  1. 1. Capture the entire page (curl, etc.)
  2. 2. Decode the HTML characters using BeautifulSoup4

BlackHat11: Serving HTML as PDF

implementation HARD : defense STRONG : hack-level CORPORATE

Convert all Text to PDF

Use PhantomJS, Wkhtmltopdf or PDFKit (node.js)

Eschew style conventions and use multi-columns!

WhiteHat11: Serving HTML as PDF

Use OCR to extract text and images from the text
Tabula to extract tabular data

BlackHat12: Text remapping

implementation WTF : defense RIDICULOUS : hack-level HOLLYWOOD

Alter text from visual display:


Hidden spans

Font remapping

Javascript text manipulation

Alter the text as it is copied. JSfiddle example

function addLink() {
    //Get the selected text and append the extra info
    var selection = window.getSelection(),
        pagelink = '<br /><br /> Read more at: ' + document.location.href,
        copytext = selection + pagelink,
        newdiv = document.createElement('div');

    //hide the newly created container = 'absolute'; = '-99999px';

    //insert the container, fill it with the extended text, and define the new selection
    newdiv.innerHTML = copytext;

    window.setTimeout(function () {
    }, 100);
document.addEventListener('copy', addLink);

Hidden spans

simple text below right?


copy and paste transforms

 <p class="codeblock">
   <span style="position: absolute; left: -100px; top: -100px">gCRT3Qg3</span>
   <span style="position: absolute; left: -100px; top: -100px">T7SQNdsF</span>
   <span style="position: absolute; left: -100px; top: -100px">TBsh8T3T</span>
   <span style="position: absolute; left: -100px; top: -100px">WKaKeTMg</span>
   <span style="position: absolute; left: -100px; top: -100px">ayRwzhur</span>
   <span style="position: absolute; left: -100px; top: -100px">tNVKkXZV</span>

Any data payload can be inserted here (e.g. copyright claims, point of origin, etc...)

Font remapping

Render document to PDF and remap fonts per document for protected data.

Example: font_remapping.pdf

WTH? How does it work?

A PDF is a collection of symbols drawn on a page. Draw `c` here, draw `a` there, etc. A PDF reader only knows what a letter is because it maps to a specific character code in the font. Simply create a new font that lies about its mapping.

Multiple fonts can be used to improve the "encryption" process,
one font per character gives a one-time pad!

WhiteHat12: Text remapping

For Javascript remapping use a headless browser. For hidden spans, learn and write custom rules to remove the offending page elements. For font remapping...

Throw money and humans at it: Mechanial Turk

Thanks, you!

Got any more Black Hat Hacks? Let us know!

#blackhatdata / @metasemantic / @rdempsey