Black Hat

Data Wrangling

Data Wranglers

Want your data!

How can you stop slow them?

White-hat data wrangler

Working hard to make your data accessible to others.

But what if you don't want people to have your data...?

Black-hat data wrangler

Working hard to make your data as inaccessible as possible.

What kind of data?

Corporate finance / Political Donations / Regulations /
Anti-FOIA / Digital Marketer / ...

Why not disconnect from the net?

When do you need a

Black Hack Data Wrangler?

You have a large amount of data

The data must be made public

Should be human readable but computer-unfriendly

Your actions should be hidden to a casual user

Presentation format

Hack quantification

implementation : EASY : MEDIUM : HARD : WTF

defense : WEAK : REASONABLE : STRONG : RIDICULOUS

hack-level : SCRIPT-KIDDIE : CORPORATE : HOLLYWOOD

Disable right-click [T]

Minification [R]

Authentication [T]

Data limits [R]

Rendering to images [R]

JavaScript page links [T]

Watermarking [R]

Honeypots & Stenography [T]

Remove markup metadata [T]

HTML obfuscation [R]

Serving HTML as PDF [R]

Text remapping [T]

`BlackHat1`: Disable right-click

implementation EASY : defense WEAK : hack-level SCRIPT-KIDDIE

<script language="javascript">
document.onmousedown=disableclick;
status="Right Click Disabled";
function disableclick(event) {
  if(event.button==2) {
     alert(status);
     return false;    
} }
</script>

Also in this category, CSS overlays.

`WhiteHat1`: Disable right-click

Open developers console (F12), search for `disableclick` and remove.

Turn of javascript.

Use a headless (or mobile) browser.

`BlackHat2`: Minification

implementation EASY : defense WEAK : hack-level SCRIPT-KIDDIE

Kangax HTML Minifier: removes comments, whitespace, empty elements, and much more. Also minifies javascript and CSS. Ruby wrapper: html_minifier

INPUT

<div class="reveal">
    <div class="slides">
        <section class="vertical-stack">
            <section class="vertical-slide">
                <h1>Black Hat</h1>
                <h1>Data Wrangling</h1>
                <hr>
                <h3><a href="http://thoppe.github.io/">Travis Hoppe</a> /
                <a href=
                "http://robertwdempsey.com/about-robert-dempsey/">Robert
                Dempsey</a></h3><a href=
                "https://twitter.com/metasemantic">@metasemantic</a> /
                <a href="https://twitter.com/rdempsey">@rdempsey</a>
                <br>
            </section>
        </section>
    </div>
</div>

OUTPUT

<div class=reveal><div class=slides><section class=vertical-stack><section class=vertical-slide><h1>Black Hat</h1><h1>Data Wrangling</h1><hr><h3><a href="http://thoppe.github.io/">Travis Hoppe</a> / <a href="http://robertwdempsey.com/about-robert-dempsey/">Robert Dempsey</a></h3><a href=https://twitter.com/metasemantic>@metasemantic</a> / <a href=https://twitter.com/rdempsey>@rdempsey</a><p></p><br></section></section></div></div>

`WhiteHat2`: Minification

De-minify the HTML using freely available tools.

Online tools: Unminify, JS Beautifier
or
Text editor: HTML Tidy (Sublime Text)
or
Automate it: JS Beautifier

$ pip install jsbeautifier
$ js-beautify file.js

`BlackHat3`: Authentication

implementation MEDIUM : defense REASONABLE : hack-level CORPORATE

not RESTful?

Implement visitor control via $SESSIONS. Give every new visitor to the site a unique ID that you control and limit access with. Bonus, restrict user-agent.

REST API?

Require all meaningful data requests to go through OAuth2, cumbersome for new-comers and direct control over the data distribution.

`WhiteHat3`: Authentication

Create session ID's with headless browsers
and
simulate user-agents

Black Hat Warning: Poorly designed session states
(that don't clear and hold large internal variables) can DoS your server!

`BlackHat4:` Data & time limits

implementation MEDIUM : defense REASONABLE : hack-level CORPORATE

Detection: high download rates or unusual traffic within a given timespan;
all traffic from a single client or IP address.

Rate limit individual IP addresses or a specific id.
Delay content delivery.
Return HTTP 301, 40x or 50x errors (full list)

`WhiteHat4`: Data & time limits

Cycle your IP address using VPN/proxy services or TOR (see TOR spiders).
and
Slow down your scraper: Scrapy autothrottle, custom timing code
and
Change your user agent: Scrapy random user agent, custom Python code

`BlackHat5`: Rendering to images

implementation MEDIUM : defense STRONG : hack-level CORPORATE

Text to Image

PHP Text to Image / ImageMagick
or
Draw text onto an HTML5 canvas using JavaScript / use the HTML5 canvasElement.toDataURL element

`WhiteHat5`: Rendering to images

Server or desktop-based OCR software
or
Adobe Acrobat: Image -> PDF -> OCR (manual)
or
Python: OCRopus
or
Tesseract Open Source OCR Engine

`BlackHat6`: JavaScript page links

implementation MEDIUM : defense REASONABLE : hack-level CORPORATE
Infinite pagination/scroll. Ex. Dribble

Forces the user to simulate AJAX (stops headless browsers).
Combine with user sessions and data limits!

Psychology in Human-Computer Interaction by David Kieras
shows this frustrates the user with lack of control.

Image from visualhierarchy

Image from Per Vestman@Dribble

`WhiteHat6`: JavaScript page links

Don't emulate a browser, be the browser! Selenium ex.

from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.google.com')

q = driver.find_element(By.NAME, 'q')
q.send_keys('Black Hat Data Wrangling')
q.submit()

`BlackHat7`: Watermarking

implementation EASY : defense STRONG : hack-level SCRIPT-KIDDIE

Easy mode, simple IP protection

Can watermark non images too!

`WhiteHat7`: Watermarking

Simple removal

Crop the picture in any photo editor
or
Use the restoration function in Inpaint: $20

More complex removal

"Content Aware Fill" in Photoshop

Cropping

Content Aware Fill in Photoshop

`BlackHat8`: Honeypots & Steganography

implementation HARD : defense RIDICULOUS : hack-level HOLLYWOOD
Steganography: embed data to identify and track IP/credentials.

A legal strong-arm strategy, freely give data but track its distribution.

Useful to determine ToS violations.

Poison the well! Leave fake data buried deep within the dataset.

Image steganography

Hide data in the EXIF header (obvious place, easy to remove), ExifTool

$ identify -verbose panda.jpg 

 Image: panda.jpg
  Format: JPEG (Joint Photographic Experts Group JFIF format)
  ...
  Properties:
    date:create: 2016-01-10T11:58:10-05:00
    exif:ApertureValue: 327680/65536
    exif:ColorSpace: 1
    exif:DateTime: 2009:08:01 08:59:44
    exif:DateTimeOriginal: 2009:07:24 04:17:22
   ...

Image steganography

Map post-filter md5sum to user data (not resistant to image changes).
Impossible for user to know what is being stored!

import numpy as np
from scipy.ndimage import imread
from scipy.misc import imsave

jpg = imread("panda.jpg")
idx = np.random.uniform(size=jpg.shape) < 0.001
jpg[idx] += np.random.uniform(-2,2, size=idx.sum()).astype(np.uint8)
jpg[jpg<0] = 0
jpg[jpg>255] = 255
imsave("panda_new.jpg", jpg)
# Test on command line
# $ md5sum *.jpg
# bd1a44ba2111eb675e78935d4d5cc186  panda.jpg
# 672c6dbf03828ea50a70bc81e19bfd69  panda_new.jpg

General steganography

Works for any lossy format (mp3, gif, etc...)
For tabular data, hide identification in NULL fields that can be easily removed.
Perturb date-times by seconds in data records and save the offset.

Honeypots

If a bot or persistent downloader is identified, feed them faulty data.
Continually degrade image quality sent as function of DL's.
Remove rows, or return records not found with increasing frequency.

`WhiteHat8`: Honeypots & Steganography

Download data multiple times from different origins.

Run diff commands to suss out data that changes by IP and user.

Sanitize data by rejecting fields and entries that change with alternative DLs.

Modify image to remove steganography (apply same trick twice!)

`BlackHat9`: Remove markup metadata

implementation HARD : defense REASONABLE : hack-level CORPORATE

Two ways:

1. Break the standard UX design.

2. Remove proper HTML/CSS markup.

Organized webpage = Organized data = Easy rip

Eschew all user design and layer components dynamically.
Example: http://arngren.net/

Remove markup. You can't rip what you can't see.

<div class="author">
    <div class="firstname">Preston </div>
    <div class="lastname"> Garvey  </div>
<div>

<div class="author">
    <div class="firstname">Piper </div> 
    <div class="lastname"> Wright  </div>
<div>

<!-- Remove all class and id labels, like this --> 
<div style="font-weight: bold;">
Preston Garvey </br>
Piper Wright
</div>

`WhiteHat9`: Remove markup

Rare in the wild as this makes web development a nightmare.
often found when dev's use lazy CMS...

Removing meta data slows users down, but syntax rules can be written per item:

html = '''
<div style="font-weight: bold;">
Preston Garvey </br>
Piper Wright
</div>'''

import bs4
soup = bs4.BeautifulSoup(html,'lxml')
text = soup.div.text
names = text.strip().split('\n')

keys = "firstname", "lastname"
data = [dict(zip(keys,x.split())) for x in names]

print data
# [{'lastname': u'Garvey', 'firstname': u'Preston'}, {'lastname': u'Wright', 'firstname': u'Piper'}]

`BlackHat10`: HTML obfuscation

implementation EASY : defense REASONABLE : hack-level SCRIPT-KIDDIE

Encode everything with HTML character codes and insert random benign HTML.

Start with this:

This is a string of text

Encode to this:

&#84;&#104;&#105;&#115;&#32;&#105;&#115;&#32;&#97;<u></u>&#32;&#115;
<i></i>&#116;&#114;&#105;<u></u>&#110;&#103;<i></i>&#32;<u></u>&#111;&#102;&#32;&#116;&#101;&#120;&#116;

'View Source' shows this:

<p>&#84;&#104;&#105;&#115;&#32;&#112;&#97;&#103;&#101;
&#32;&#105;&#115;<i></i>&#32;<u></u>&#109;&#101;&#97;&#110;<b></b>

`WhiteHat10`: HTML obfuscation

Use the Selenium Web Driver

1. Create a headless web browser
2. Open the page
3. Take a screenshot of the page
4. Use OCR to extract the text from the screenshot

1. Capture the entire page (curl, etc.)
2. Decode the HTML characters using BeautifulSoup4

`BlackHat11`: Serving HTML as PDF

implementation HARD : defense STRONG : hack-level CORPORATE

Convert all Text to PDF

Use PhantomJS, Wkhtmltopdf or PDFKit (node.js)

Eschew style conventions and use multi-columns!

`WhiteHat11`: Serving HTML as PDF

Use OCR to extract text and images from the text
or
Tabula to extract tabular data

`BlackHat12`: Text remapping

implementation WTF : defense RIDICULOUS : hack-level HOLLYWOOD

Alter text from visual display:

Javascript

Hidden spans

Font remapping

Javascript text manipulation

Alter the text as it is copied. JSfiddle example

function addLink() {
    //Get the selected text and append the extra info
    var selection = window.getSelection(),
        pagelink = '<br /><br /> Read more at: ' + document.location.href,
        copytext = selection + pagelink,
        newdiv = document.createElement('div');

    //hide the newly created container
    newdiv.style.position = 'absolute';
    newdiv.style.left = '-99999px';

    //insert the container, fill it with the extended text, and define the new selection
    document.body.appendChild(newdiv);
    newdiv.innerHTML = copytext;
    selection.selectAllChildren(newdiv);

    window.setTimeout(function () {
        document.body.removeChild(newdiv);
    }, 100);
}
document.addEventListener('copy', addLink);

Hidden spans

simple text below right?

TgCRT3Qg3RT7SQNdsFATBsh8T3TVWKaKeTMgIayRwzhurStNVKkXZV

copy and paste transforms
TRAVIS to TgCRT3Qg3RT7SQNdsFATBsh8T3TVWKaKeTMgIayRwzhurS

 <p class="codeblock">
   T
   <span style="position: absolute; left: -100px; top: -100px">gCRT3Qg3</span>
   R
   <span style="position: absolute; left: -100px; top: -100px">T7SQNdsF</span>
   A
   <span style="position: absolute; left: -100px; top: -100px">TBsh8T3T</span>
   V
   <span style="position: absolute; left: -100px; top: -100px">WKaKeTMg</span>
   I
   <span style="position: absolute; left: -100px; top: -100px">ayRwzhur</span>
   S
   <span style="position: absolute; left: -100px; top: -100px">tNVKkXZV</span>
 </p>

Any data payload can be inserted here (e.g. copyright claims, point of origin, etc...)

Font remapping

Render document to PDF and remap fonts per document for protected data.

Example: font_remapping.pdf

WTH? How does it work?

A PDF is a collection of symbols drawn on a page. Draw `c` here, draw `a` there, etc. A PDF reader only knows what a letter is because it maps to a specific character code in the font. Simply create a new font that lies about its mapping.

Multiple fonts can be used to improve the "encryption" process,
one font per character gives a one-time pad!

`WhiteHat12`: Text remapping

For Javascript remapping use a headless browser. For hidden spans, learn and write custom rules to remove the offending page elements. For font remapping...

Throw money and humans at it: Mechanial Turk

Thanks, you!

Got any more Black Hat Hacks? Let us know!

Black Hat

Data Wrangling

Travis Hoppe / Robert Dempsey

Data Wranglers

Want your data!

How can you stop slow them?

White-hat data wrangler

Black-hat data wrangler

Why not disconnect from the net?

When do you need a

Black Hack Data Wrangler?

You have a large amount of data

The data must be made public

Should be human readable but computer-unfriendly

Your actions should be hidden to a casual user

Presentation format

Hack quantification

Table of Contents

Disable right-click [T]

Minification [R]

Authentication [T]

Data limits [R]

Rendering to images [R]

JavaScript page links [T]

Watermarking [R]

Honeypots & Stenography [T]

Remove markup metadata [T]

HTML obfuscation [R]

Serving HTML as PDF [R]

Text remapping [T]

BlackHat1: Disable right-click

WhiteHat1: Disable right-click

Open developers console (F12), search for disableclick and remove.

Turn of javascript.

Use a headless (or mobile) browser.

BlackHat2: Minification

INPUT

OUTPUT

WhiteHat2: Minification

BlackHat3: Authentication

not RESTful?

REST API?

WhiteHat3: Authentication

BlackHat4: Data & time limits

WhiteHat4: Data & time limits

BlackHat5: Rendering to images

Text to Image

WhiteHat5: Rendering to images

BlackHat6: JavaScript page links

WhiteHat6: JavaScript page links

BlackHat7: Watermarking

Easy mode, simple IP protection

WhiteHat7: Watermarking

Simple removal

More complex removal

Cropping

Content Aware Fill in Photoshop

BlackHat8: Honeypots & Steganography

Image steganography

Image steganography

General steganography

Honeypots

WhiteHat8: Honeypots & Steganography

BlackHat9: Remove markup metadata

Two ways:

1. Break the standard UX design.

2. Remove proper HTML/CSS markup.

WhiteHat9: Remove markup

BlackHat10: HTML obfuscation

WhiteHat10: HTML obfuscation

BlackHat11: Serving HTML as PDF

Convert all Text to PDF

WhiteHat11: Serving HTML as PDF

BlackHat12: Text remapping

Javascript

Hidden spans

Font remapping

Javascript text manipulation

Hidden spans

TgCRT3Qg3RT7SQNdsFATBsh8T3TVWKaKeTMgIayRwzhurStNVKkXZV

`BlackHat1`: Disable right-click

`WhiteHat1`: Disable right-click

Open developers console (F12), search for `disableclick` and remove.

`BlackHat2`: Minification

`WhiteHat2`: Minification

`BlackHat3`: Authentication

`WhiteHat3`: Authentication

`BlackHat4:` Data & time limits

`WhiteHat4`: Data & time limits

`BlackHat5`: Rendering to images

`WhiteHat5`: Rendering to images

`BlackHat6`: JavaScript page links

`WhiteHat6`: JavaScript page links

`BlackHat7`: Watermarking

`WhiteHat7`: Watermarking

`BlackHat8`: Honeypots & Steganography

`WhiteHat8`: Honeypots & Steganography

`BlackHat9`: Remove markup metadata

`WhiteHat9`: Remove markup

`BlackHat10`: HTML obfuscation

`WhiteHat10`: HTML obfuscation

`BlackHat11`: Serving HTML as PDF

`WhiteHat11`: Serving HTML as PDF

`BlackHat12`: Text remapping

`WhiteHat12`: Text remapping

`#blackhatdata` / @metasemantic / @rdempsey