Scaling Up Machine Learning_Free Document Search and Download

Scaling Up Machine Learning

Parallel and Distributed Approaches

Ron Bekkerman, LinkedIn

Misha Bilenko, MSR

John Langford, Y!R->MSR

Presented by Tom

http://hunch.net/~large_scale_survey

Outline

Big DATA
Crowdsourcing Labeled Data
Parallelization: Platform Choices

Example on k-means clustering

The book

Cambridge Uni Press
Due in November 2011
21 chapters
Covering

Platforms
Algorithms
Learning setups
Applications

Chapter contributors

Previous books

1998

2000

Data hypergrowth: an example

Reuters-21578: about 10K docs (ModApte)

RCV1: about 807K docs

LinkedIn job title data: about 100M docs

Bekkerman et al, SIGIR 2001

Bekkerman & Scholz, CIKM 2008

Bekkerman & Gavish, KDD 2011

New age of big data

The world has gone mobile

5 billion cellphones produce daily data

Social networks have gone online

Twitter produces 200M tweets a day

Crowdsourcing is the reality

Labeling of 100,000+ data instances is doable

Within a week 

Size matters

One thousand data instances
One million data instances
One billion data instances
One trillion data instances

Those are not different numbers, those are different mindsets 

One thousand data instances

Will process it manually within a day (a week?)

No need in an automatic approach

We shouldn’t publish main results on datasets of such size 

One million data instances

Currently, the most active zone
Can be crowdsourced
Can be processed by a quadratic algorithm

Once parallelized

1M data collection cannot be too diverse

But can be too homogenous

Preprocessing / data probing is crucial

Big dataset cannot be too sparse

1M data instances cannot belong to 1M classes

Simply because it’s not practical to have 1M classes 

Here’s a statistical experiment, in text domain:

1M documents
Each document is 100 words long
Randomly sampled from a unigram language model

No stopwords

245M pairs have word overlap of 10% or more

Real-world datasets are denser than random

Can big datasets be too dense?

3746554337.jpg

8374565642.jpg

2648697083.jpg

7264545727.jpg

6255434389.jpg

5039287651.jpg

3045938173.jpg

4596867462.jpg

8871536482.jpg

2037582194.jpg

One billion data instances

Web-scale
Guaranteed to contain data in different formats

ASCII text, pictures, javascript code, PDF documents…

Guaranteed to contain (near) duplicates
Likely to be badly preprocessed 
Storage is an issue

One trillion data instances

Beyond the reach of the modern technology
Peer-to-peer paradigm is (arguably) the only way to process the data
Data privacy / inconsistency / skewness issues

Can’t be kept in one location
Is intrinsically hard to sample

A solution to data privacy problem

n machines with n private datasets

All datasets intersect
The intersection is shared

Each machine learns a separate model
Models get consistent over the data intersection

Xiang et al, Chapter 16

D₁

D₂

D₃

D₄

D₅

D₆

Check out Chapter 16 to see this approach applied in a recommender system!

So what model will we learn?

Supervised model?
Unsupervised model?
Semi-supervised model?

Obviously, depending on the application 

But also on availability of labeled data
And its trustworthiness!

Size of training data

Say you have 1K labeled and 1M unlabeled examples

Labeled/unlabeled ratio: 0.1%
Is 1K enough to train a supervised model?

Now you have 1M labeled and 1B unlabeled examples

Labeled/unlabeled ratio: 0.1%
Is 1M enough to train a supervised model?

Skewness of training data

Usually, training data comes from users
Explicit user feedback might be misleading

Feedback providers may have various incentives

Learning from implicit feedback is a better idea

E.g. clicking on Web search results

In large-scale setups, skewness of training data is hard to detect

Real-world example

Goal: find high-quality professionals on LinkedIn
Idea: use recommendation data to train a model

Whoever has recommendations is a positive example

Is it a good idea? 

Not enough (clean) training data?

Use existing labels as a guidance rather than a directive

In a semi-supervised clustering framework

Or label more data! 

With a little help from the crowd

Semi-supervised clustering

Cluster unlabeled data D while taking labeled data D^* into account
Construct clustering while maximizing Mutual Information

And keeping the number of clusters k constant
is defined naturally over classes in D^*

Results better than those of classification

Bekkerman et al, ECML 2006

Semi-supervised clustering (details)

Define an empirical joint distribution P(D, D^*)

P(d, d^*) is a normalized similarity between d and d^*

Define the joint between clusterings

Where

and are marginals

Crowdsourcing labeled data

Crowdsourcing is a tough business 

People are not machines

Any worker who can game the system will game the system
Validation framework + qualification tests are a must
Labeling a lot of data can be fairly expensive

How to label 1M instances

Budget a month of work + about $50,000

How to label 1M instances

Hire a data annotation contractor in your town

Presumably someone you know well enough

How to label 1M instances

Offer the guy $10,000 for one month of work

How to label 1M instances

Construct a qualification test for your job
Hire 100 workers who pass it

Keep their worker IDs

How to label 1M instances

Explain to them the task, make sure they get it

How to label 1M instances

Offer them 4¢ per data instance if they do it right

How to label 1M instances

Your contractor will label 500 data instances a day
This data will be used to validate worker results

500

How to label 1M instances

You’ll need to spot-check the results

How to label 1M instances

You’ll need to spot-check the results

How to label 1M instances

Each worker gets a daily task of 1000 data instances

1000

How to label 1M instances

Some of which are already labeled by the contractor

1000

How to label 1M instances

Check every worker’s result on that validation set

How to label 1M instances

Check every worker’s result on that validation set

How to label 1M instances

Fire the worst 50 workers

Disregard their results

How to label 1M instances

Hire 50 new ones

How to label 1M instances

Repeat for a month (20 working days)

50 workers × 20 days × 1000 data points a day × 4¢

Got 1M labeled instances, now what?

Now go train your model 
Rule of the thumb: heavier algorithms produce better results
Rule of the other thumb: forget about super-quadratic algorithms

Parallelization looks unavoidable

Parallelization: platform choices

Platform

Communication Scheme

Data size

Peer-to-Peer

TCP/IP

Petabytes

Virtual Clusters

MapReduce / MPI

Terabytes

HPC Clusters

MPI / MapReduce

Terabytes

Multicore

Multithreading

Gigabytes

GPU

CUDA

Gigabytes

FPGA

HDL

Gigabytes

Example: k-means clustering

An EM-like algorithm:
Initialize k cluster centroids
E-step: associate each data instance with the closest centroid

Find expected values of cluster assignments given the data and centroids

M-step: recalculate centroids as an average of the associated data instances

Find new centroids that maximize that expectation

Parallelizing k-means

Parallelization: platform choices

Platform

Communication Scheme

Data size

Peer-to-Peer

TCP/IP

Petabytes

Virtual Clusters

MapReduce / MPI

Terabytes

HPC Clusters

MPI / MapReduce

Terabytes

Multicore

Multithreading

Gigabytes

GPU

CUDA

Gigabytes

FPGA

HDL

Gigabytes

Peer-to-peer (P2P) systems

Millions of machines connected in a network

Each machine can only contact its neighbors

Each machine storing millions of data instances

Practically unlimited scale 

Communication is the bottleneck

Aggregation is costly, broadcast is cheaper

Messages are sent over a spanning tree

With an arbitrary node being the root

k-means in P2P

Uniformly sample k centroids over P2P

Using a random walk method

Broadcast the centroids
Run local k-means on each machine
Sample n nodes
Aggregate local centroids of those n nodes

Datta et al, TKDE 2009

Parallelization: platform choices

Platform

Communication Scheme

Data size

Peer-to-Peer

TCP/IP

Petabytes

Virtual Clusters

MapReduce / MPI

Terabytes

HPC Clusters

MPI / MapReduce

Terabytes

Multicore

Multithreading

Gigabytes

GPU

CUDA

Gigabytes

FPGA

HDL

Gigabytes

Virtual clusters

Datacenter-scale clusters

Hundreds of thousands of machines

Distributed file system

Data redundancy

Cloud computing paradigm

Virtualization, full fault tolerance, pay-as-you-go

MapReduce is #1 data processing scheme

MapReduce

Mappers

Reducers

Process in parallel → shuffle → process in parallel
Mappers output (key, value) records

Records with the same key are sent to the same reducer

k-means on MapReduce

Mappers read data portions and centroids
Mappers assign data instances to clusters
Mappers compute new local centroids and local cluster sizes
Reducers aggregate local centroids (weighted by local cluster sizes) into new global centroids
Reducers write the new centroids

Panda et al, Chapter 2

Discussion on MapReduce

MapReduce is not designed for iterative processing

Mappers read the same data again and again

MapReduce looks too low-level to some people

Data analysts are traditionally SQL folks 

MapReduce looks too high-level to others

A lot of MapReduce logic is hard to adapt

Example: grouping documents by words

MapReduce wrappers

Many of them are available

At different levels of stability 

Apache Pig is an SQL-like environment

Group, Join, Filter rows, Filter columns (Foreach)
Developed at Yahoo! Research

DryadLINQ is a C#-like environment

Developed at Microsoft Research

Olston et al, SIGMOD 2008

Yu et al, OSDI 2008

k-means in Apache Pig: input data

Assume we need to cluster documents

Stored in a 3-column table D:

Initial centroids are k randomly chosen docs

Stored in table C in the same format as above

Document

Word

Count

doc1

new

doc1

york

D_C = JOIN C BY w, D BY w;

PROD = FOREACH D_C GENERATE d, c, i_d * i_c AS i_di_c ;

PROD_g = GROUP PROD BY (d, c);

DOT_PROD = FOREACH PROD_g GENERATE d, c, SUM(i_di_c) AS dXc;

SQR = FOREACH C GENERATE c, i_c * i_c AS i_c²;

SQR_g = GROUP SQR BY c;

LEN_C = FOREACH SQR_g GENERATE c, SQRT(SUM(i_c²)) AS len_c;

DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;

SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len_c;

SIM_g = GROUP SIM BY d;

CLUSTERS = FOREACH SIM_g GENERATE TOP(1, 2, SIM);

k-means in Apache Pig: E-step

Document

Word

Count(D)

Centroid

Count(C)

D_C = JOIN C BY w, D BY w;

PROD = FOREACH D_C GENERATE d, c, i_d * i_c AS i_di_c ;

PROD_g = GROUP PROD BY (d, c);

DOT_PROD = FOREACH PROD_g GENERATE d, c, SUM(i_di_c) AS dXc;

SQR = FOREACH C GENERATE c, i_c * i_c AS i_c²;

SQR_g = GROUP SQR BY c;

LEN_C = FOREACH SQR_g GENERATE c, SQRT(SUM(i_c²)) AS len_c;

DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;

SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len_c;

SIM_g = GROUP SIM BY d;

CLUSTERS = FOREACH SIM_g GENERATE TOP(1, 2, SIM);

k-means in Apache Pig: E-step

D_C = JOIN C BY w, D BY w;

PROD = FOREACH D_C GENERATE d, c, i_d * i_c AS i_di_c ;

PROD_g = GROUP PROD BY (d, c);

DOT_PROD = FOREACH PROD_g GENERATE d, c, SUM(i_di_c) AS dXc;

SQR = FOREACH C GENERATE c, i_c * i_c AS i_c²;

SQR_g = GROUP SQR BY c;

LEN_C = FOREACH SQR_g GENERATE c, SQRT(SUM(i_c²)) AS len_c;

DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;

SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len_c;

SIM_g = GROUP SIM BY d;

CLUSTERS = FOREACH SIM_g GENERATE TOP(1, 2, SIM);

k-means in Apache Pig: E-step

D_C = JOIN C BY w, D BY w;

PROD = FOREACH D_C GENERATE d, c, i_d * i_c AS i_di_c ;

PROD_g = GROUP PROD BY (d, c);

DOT_PROD = FOREACH PROD_g GENERATE d, c, SUM(i_di_c) AS dXc;

SQR = FOREACH C GENERATE c, i_c * i_c AS i_c²;

SQR_g = GROUP SQR BY c;

LEN_C = FOREACH SQR_g GENERATE c, SQRT(SUM(i_c²)) AS len_c;

DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;

SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len_c;

SIM_g = GROUP SIM BY d;

CLUSTERS = FOREACH SIM_g GENERATE TOP(1, 2, SIM);

k-means in Apache Pig: E-step

D_C = JOIN C BY w, D BY w;

PROD = FOREACH D_C GENERATE d, c, i_d * i_c AS i_di_c ;

PROD_g = GROUP PROD BY (d, c);

DOT_PROD = FOREACH PROD_g GENERATE d, c, SUM(i_di_c) AS dXc;

SQR = FOREACH C GENERATE c, i_c * i_c AS i_c²;

SQR_g = GROUP SQR BY c;

LEN_C = FOREACH SQR_g GENERATE c, SQRT(SUM(i_c²)) AS len_c;

DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;

SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len_c;

SIM_g = GROUP SIM BY d;

CLUSTERS = FOREACH SIM_g GENERATE TOP(1, 2, SIM);

k-means in Apache Pig: E-step

D_C = JOIN C BY w, D BY w;

PROD = FOREACH D_C GENERATE d, c, i_d * i_c AS i_di_c ;

PROD_g = GROUP PROD BY (d, c);

DOT_PROD = FOREACH PROD_g GENERATE d, c, SUM(i_di_c) AS dXc;

SQR = FOREACH C GENERATE c, i_c * i_c AS i_c²;

SQR_g = GROUP SQUA BY c;

LEN_C = FOREACH SQR_g GENERATE c, SQRT(SUM(i_c²)) AS len_c;

DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;

SIM = FOREACH DOT_LEN GENERATE d, c, dXc / len_c;

SIM_g = GROUP SIM BY d;

CLUSTERS = FOREACH SIM_g GENERATE TOP(1, 2, SIM);

k-means in Apache Pig: M-step

D_C_W = JOIN CLUSTERS BY d, D BY d;

D_C_W_g = GROUP D_C_W BY (c, w);

SUMS = FOREACH D_C_W_g GENERATE c, w, SUM(i_d) AS sum;

D_C_W_gg = GROUP D_C_W BY c;

SIZES = FOREACH D_C_W_gg GENERATE c, COUNT(D_C_W) AS size;

SUMS_SIZES = JOIN SIZES BY c, SUMS BY c;

C = FOREACH SUMS_SIZES GENERATE c, w, sum / size AS i_c ;

MapReduce job setup time

In an iterative process, setting up a MapReduce job at each iteration is costly
Solution: forward scheduling

Setup the next job before the previous completed

Panda et al, Chapter 2

Setup

Process

Tear down

Setup

Process

Tear down

Setup

Process

Tear down

Data

k-means in DryadLINQ

Vector NearestCenter(Vector point, IQueryable<Vector> centers)

{

var nearest = centers.First();

foreach (var center in centers)

if ((point - center).Norm() < (point - nearest).Norm())

nearest = center;

return nearest;

}

IQueryable<Vector> KMeansStep(IQueryable<Vector> vectors,

IQueryable<Vector> centers)

{

return vectors.GroupBy(vector => NearestCenter(vector, centers))

.Select(g => g.Aggregate((x,y) => x+y) / g.Count());

}

Budiu et al, Chapter 3

Takeaways on MapReduce wrappers

Machine learning in SQL is fairly awkward 
DryadLINQ looks much more suitable

Beta available at http://blogs.technet.com/b/windowshpc/archive/2011/07/07/announcing-linq-to-hpc-beta-2.aspx
Check out Chapter 3 for a Kinect application!!!

Writing high-level code requires deep understanding of low-level processes

Parallelization: platform choices

Platform

Communication Scheme

Data size

Peer-to-Peer

TCP/IP

Petabytes

Virtual Clusters

MapReduce / MPI

Terabytes

HPC Clusters

MPI / MapReduce

Terabytes

Multicore

Multithreading

Gigabytes

GPU

CUDA

Gigabytes

FPGA

HDL

Gigabytes

HPC clusters

High Performance Computing clusters / blades / supercomputers

Thousands of cores

Great variety of architectural choices

Disk organization, cache, communication etc.

Fault tolerance mechanisms are not crucial

Hardware failures are rare

Most typical communication protocol: MPI

Message Passing Interface

Gropp et al, MIT Press 1994

Message Passing Interface (MPI)

Runtime communication library

Available for many programming languages

MPI_Bsend(void* buffer, int size, int destID)

Serialization is on you 

MPI_Recv(void* buffer, int size, int sourceID)

Will wait until receives it

MPI_Bcast – broadcasts a message
MPI_Barrier – synchronizes all processes

MapReduce vs. MPI

MPI is a generic framework

Processes send messages to other processes
Any computation graph can be built

Most suitable for the master/slave model

k-means using MPI

Slaves read data portions
Master broadcasts centroids to slaves
Slaves assign data instances to clusters
Slaves compute new local centroids and local cluster sizes

Then send them to the master

Master aggregates local centroids weighted by local cluster sizes into new global centroids

Pednault et al, Chapter 4

Two features of MPI parallelization

State-preserving processes

Processes can live as long as the system runs
No need to read the same data again and again
All necessary parameters can be preserved locally

Hierarchical master/slave paradigm

A slave can be a master of other processes
Could be very useful in dynamic resource allocation

When a slave recognizes it has too much stuff to process

Pednault et al, Chapter 4

Takeaways on MPI

Old, well established, well debugged
Very flexible
Perfectly suitable for iterative processing
Fault intolerant
Not that widely available anymore 

An open source implementation: OpenMPI
MPI can be deployed on Hadoop

Ye et al, CIKM 2009

Parallelization: platform choices

Platform

Communication Scheme

Data size

Peer-to-Peer

TCP/IP

Petabytes

Virtual Clusters

MapReduce / MPI

Terabytes

HPC Clusters

MPI / MapReduce

Terabytes

Multicore

Multithreading

Gigabytes

GPU

CUDA

Gigabytes

FPGA

HDL

Gigabytes

Multicore

One machine, up to dozens of cores
Shared memory, one disk
Multithreading as a parallelization scheme
Data might not fit the RAM

Use streaming to process the data in portions
Disk access may be the bottleneck

If it does fit, RAM access is the bottleneck

Use uniform, small size memory requests

Tatikonda & Parthasarathy, Chapter 20

Parallelization: platform choices

Platform

Communication Scheme

Data size

Peer-to-Peer

TCP/IP

Petabytes

Virtual Clusters

MapReduce / MPI

Terabytes

HPC Clusters

MPI / MapReduce

Terabytes

Multicore

Multithreading

Gigabytes

GPU

CUDA

Gigabytes

FPGA

HDL

Gigabytes

Graphics Processing Unit (GPU)

GPU has become General-Purpose (GP-GPU)
CUDA is a GP-GPU programming framework

Each GPU consists of hundreds of multiprocessors
Each multiprocessor consists of a few ALUs

ALUs execute the same line of code synchronously

When code branches, some multiprocessors stall

Avoid branching as much as possible

Machine learning with GPUs

To fully utilize a GPU, the data needs to fit in RAM

This limits the maximal size of the data

GPUs are optimized for speed

A good choice for real-time tasks

A typical usecase: a model is trained offline and then applied in real-time (inference)

Machine vision / speech recognition are example domains

Coates et al, Chapter 18

Chong et al, Chapter 21

k-means clustering on a GPU

Cluster membership assignment done on GPU:

Centroids are uploaded to every multiprocessor
A multiprocessor works on one data vector at a time
Each ALU works on one data dimension

Centroid recalculation is then done on CPU
Most appropriate for processing dense data
Scattered memory access should be avoided
A multiprocessor reads a data vector while its ALUs process a previous vector

Hsu et al, Chapter 5

Performance results

4 millions 8-dimensional vectors
400 clusters
50 k-means iterations

9 seconds!!!

Parallelization: platform choices

Platform

Communication Scheme

Data size

Peer-to-Peer

TCP/IP

Petabytes

Virtual Clusters

MapReduce / MPI

Terabytes

HPC Clusters

MPI / MapReduce

Terabytes

Multicore

Multithreading

Gigabytes

GPU

CUDA

Gigabytes

FPGA

HDL

Gigabytes

Field-programmable gate array (FPGA)

Highly specialized hardware units
Programmable in Hardware Description Language (HDL)
Applicable to training and inference

Check out Chapter 7 for a hybrid parallelization: multicore (coarse-grained) + FPGA (fine-grained)

Durdanovic et al, Chapter 7

Farabet et al, Chapter 19

How to choose a platform

Obviously depending on the size of the data

A cluster is a better option if data doesn’t fit in RAM

Optimizing for speed or for throughput

GPUs and FPGAs can reach enormous speeds

Training a model / applying a model

Training is usually offline

Conclusion

Big DATA
Crowdsourcing Labeled Data
Parallelization: Platform Choices

Example on k-means clustering
Peer-to-Peer
Virtual clusters
HPC clusters
Multicore
GPU
FPGA

Thank You!

http://hunch.net/~large_scale_survey

Scaling Up Machine Learning

Recent Documents:

Recent Search: