Home > Scaling Up Machine Learning
Scaling Up Machine Learning
Parallel
and Distributed Approaches
Ron Bekkerman, LinkedIn
Misha Bilenko, MSR
John Langford, Y!R->MSR
Presented by Tom
http://hunch.net/~large_scale_survey
1
Outline
2
The book
3
10
2
Chapter contributors
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
20
21
4
Previous books
1998
2000
2000
5
Data hypergrowth: an
example
Bekkerman et al,
SIGIR 2001
Bekkerman &
Scholz, CIKM 2008
Bekkerman &
Gavish, KDD 2011
6
New age of big data
7
Size matters
Those are not different numbers,
those are different mindsets
8
One thousand data instances
9
One million data instances
10
Big dataset cannot be
too sparse
11
Can big datasets be
too dense?
3746554337.jpg
8374565642.jpg
2648697083.jpg
7264545727.jpg
6255434389.jpg
5039287651.jpg
3045938173.jpg
4596867462.jpg
8871536482.jpg
2037582194.jpg
12
One billion data instances
13
One trillion data instances
14
A solution to data privacy
problem
Xiang et al, Chapter
16
D1
D2
D3
D4
D5
D6
15
So what model will we
learn?
16
Size of training data
17
Skewness of training
data
18
Real-world example
19
Not enough (clean) training
data?
20
Semi-supervised clustering
Bekkerman et al, ECML 2006
21
Semi-supervised clustering
(details)
22
Crowdsourcing labeled
data
23
How to label 1M instances
24
How to label 1M instances
25
How to label 1M instances
26
How to label 1M instances
27
How to label 1M instances
28
How to label 1M instances
29
How to label 1M instances
500
30
How to label 1M instances
31
How to label 1M instances
32
How to label 1M instances
1000
1000
1000
1000
1000
1000
1000
1000
33
How to label 1M instances
1000
1000
1000
1000
1000
1000
1000
1000
34
How to label 1M instances
35
How to label 1M instances
36
How to label 1M instances
37
How to label 1M instances
38
How to label 1M instances
39
Got 1M labeled instances,
now what?
40
Parallelization: platform
choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
41
Example: k-means
clustering
42
Parallelizing k-means
43
Parallelizing k-means
44
Parallelizing k-means
45
Parallelization: platform
choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
46
Peer-to-peer (P2P) systems
47
k-means in P2P
Datta et al, TKDE 2009
48
Parallelization: platform
choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
49
Virtual clusters
50
MapReduce
Mappers
Reducers
51
k-means on MapReduce
Panda et al, Chapter 2
52
Discussion on MapReduce
53
MapReduce wrappers
Olston et al, SIGMOD
2008
Yu et al, OSDI
2008
54
k-means in Apache
Pig: input data
Document
Word
Count
doc1
new
2
doc1
york
2
55
D_C = JOIN C BY w, D BY w;
PROD = FOREACH D_C
GENERATE d, c, id * ic
AS idic
;
PRODg = GROUP PROD BY (d, c);
DOT_PROD = FOREACH
PRODg GENERATE d, c, SUM(idic)
AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;
LEN_C
= FOREACH SQRg
GENERATE c, SQRT(SUM(ic2))
AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;
SIM = FOREACH DOT_LEN
GENERATE d, c, dXc
/ lenc;
SIMg = GROUP SIM BY d;
CLUSTERS
= FOREACH SIMg GENERATE TOP(1,
2, SIM);
k-means in Apache
Pig: E-step
Document
Word
Count(D)
Centroid
Count(C)
56
D_C = JOIN C BY w, D BY w;
PROD = FOREACH D_C
GENERATE d, c, id * ic
AS idic
;
PRODg = GROUP PROD BY (d, c);
DOT_PROD = FOREACH
PRODg GENERATE d, c, SUM(idic)
AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;
LEN_C
= FOREACH SQRg
GENERATE c, SQRT(SUM(ic2))
AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;
SIM = FOREACH DOT_LEN
GENERATE d, c, dXc
/ lenc;
SIMg = GROUP SIM BY d;
CLUSTERS
= FOREACH SIMg GENERATE TOP(1,
2, SIM);
k-means in Apache
Pig: E-step
57
D_C = JOIN C BY w, D BY w;
PROD = FOREACH D_C
GENERATE d, c, id * ic
AS idic
;
PRODg = GROUP PROD BY (d, c);
DOT_PROD = FOREACH
PRODg GENERATE d, c, SUM(idic)
AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;
LEN_C
= FOREACH SQRg
GENERATE c, SQRT(SUM(ic2))
AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;
SIM = FOREACH DOT_LEN
GENERATE d, c, dXc
/ lenc;
SIMg = GROUP SIM BY d;
CLUSTERS
= FOREACH SIMg GENERATE TOP(1,
2, SIM);
k-means in Apache
Pig: E-step
58
D_C = JOIN C BY w, D BY w;
PROD = FOREACH D_C
GENERATE d, c, id * ic
AS idic
;
PRODg = GROUP PROD BY (d, c);
DOT_PROD = FOREACH
PRODg GENERATE d, c, SUM(idic)
AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;
LEN_C
= FOREACH SQRg
GENERATE c, SQRT(SUM(ic2))
AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;
SIM = FOREACH DOT_LEN
GENERATE d, c, dXc
/ lenc;
SIMg = GROUP SIM BY d;
CLUSTERS
= FOREACH SIMg GENERATE TOP(1,
2, SIM);
k-means in Apache
Pig: E-step
59
D_C = JOIN C BY w, D BY w;
PROD = FOREACH D_C
GENERATE d, c, id * ic
AS idic
;
PRODg = GROUP PROD BY (d, c);
DOT_PROD = FOREACH
PRODg GENERATE d, c, SUM(idic)
AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQR BY c;
LEN_C
= FOREACH SQRg
GENERATE c, SQRT(SUM(ic2))
AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;
SIM = FOREACH DOT_LEN
GENERATE d, c, dXc
/ lenc;
SIMg = GROUP SIM BY d;
CLUSTERS
= FOREACH SIMg GENERATE TOP(1,
2, SIM);
k-means in Apache
Pig: E-step
60
k-means in Apache
Pig: E-step
D_C = JOIN C BY w, D BY w;
PROD = FOREACH D_C
GENERATE d, c, id * ic
AS idic
;
PRODg = GROUP PROD BY (d, c);
DOT_PROD = FOREACH
PRODg GENERATE d, c, SUM(idic)
AS dXc;
SQR = FOREACH C GENERATE c, ic * ic AS ic2;
SQRg = GROUP SQUA BY c;
LEN_C
= FOREACH SQRg
GENERATE c, SQRT(SUM(ic2))
AS lenc;
DOT_LEN = JOIN LEN_C BY c, DOT_PROD BY c;
SIM = FOREACH DOT_LEN
GENERATE d, c, dXc
/ lenc;
SIMg = GROUP SIM BY d;
CLUSTERS
= FOREACH SIMg GENERATE TOP(1,
2, SIM);
61
k-means in Apache
Pig: M-step
D_C_W
= JOIN CLUSTERS
BY d, D BY d;
D_C_Wg = GROUP D_C_W BY (c, w);
SUMS
= FOREACH D_C_Wg
GENERATE c, w,
SUM(id) AS sum;
D_C_Wgg = GROUP D_C_W BY c;
SIZES
= FOREACH D_C_Wgg
GENERATE c, COUNT(D_C_W)
AS size;
SUMS_SIZES = JOIN SIZES BY c, SUMS BY c;
C
= FOREACH SUMS_SIZES
GENERATE c, w, sum
/ size AS ic
;
62
MapReduce job setup
time
Panda et al, Chapter
2
Setup
Process
Tear down
Setup
Process
Tear down
Setup
Process
Tear down
Data
Data
Data
Data
63
k-means in DryadLINQ
Vector NearestCenter(Vector point, IQueryable<Vector> centers)
{
var nearest = centers.First();
foreach (var center in centers)
if ((point - center).Norm() < (point - nearest).Norm())
nearest = center;
return nearest;
}
IQueryable<Vector> KMeansStep(IQueryable<Vector> vectors,
IQueryable<Vector> centers)
{
return vectors.GroupBy(vector => NearestCenter(vector, centers))
.Select(g => g.Aggregate((x,y) => x+y) / g.Count());
}
Budiu et al, Chapter
3
64
Takeaways on MapReduce
wrappers
65
Parallelization: platform
choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
66
HPC clusters
Gropp et al, MIT
Press 1994
67
Message Passing Interface
(MPI)
68
MapReduce vs. MPI
69
k-means using
MPI
Pednault et al,
Chapter 4
70
Two features of MPI
parallelization
Pednault et al,
Chapter 4
71
Takeaways on MPI
Ye et al, CIKM
2009
72
Parallelization: platform
choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
73
Multicore
Tatikonda &
Parthasarathy, Chapter 20
74
Parallelization: platform
choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
75
Graphics Processing
Unit (GPU)
76
Machine learning with
GPUs
Coates et al, Chapter 18
Chong et al, Chapter
21
77
k-means clustering
on a GPU
Hsu et al, Chapter
5
78
Performance results
79
Parallelization: platform
choices
Platform
Communication Scheme
Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters
MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
80
Field-programmable gate
array (FPGA)
Durdanovic et al, Chapter 7
Farabet et al,
Chapter 19
81
How to choose a platform
82
Conclusion
83
Thank You!
http://hunch.net/~large_scale_survey
84
All Rights Reserved Powered by Free Document Search and Download
Copyright © 2011