39
DIAGNOSING OPEN- SOURCE COMMUNITY HEALTH WITH SPARK William Benton Red Hat Emerging Technology [email protected]

DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

DIAGNOSING OPEN-SOURCE COMMUNITY HEALTH WITH SPARKWilliam BentonRed Hat Emerging [email protected]

Page 2: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS

2

BACKGROUND

Page 3: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS

2

BACKGROUND

Page 4: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

WHAT MAKES A GREAT OPEN-SOURCE COMMUNITY?

3

Page 5: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

4

Page 6: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

QUICK FACTS ABOUT FEDORA• Seven architectures• ~17k packages (in Fedora 22)• Tens of thousands of contributors• Millions of users• Many moving parts in Fedora infrastructure

5

Page 7: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS

6

Page 8: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

Page 9: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

Page 10: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

meetbot

Page 11: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

meetbot koji

Page 12: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

Page 13: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

CLI tools

web gateway

#fedora-fedmsg

Page 14: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

CLI tools

web gateway

notifications#fedora-fedmsg

Page 15: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

CLI tools

web gateway

notifications

datanommer(warehouse)

#fedora-fedmsg

Page 16: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

7

fedmsg bus

packageSCM

fedmsg-hub

ansible

MediaWiki(via mod_php)

COPR and secondary arch builds

meetbot koji Apache mod_wsgi

bodhi

accounts

ticketing

package database

…many others!

IRC gateway

CLI tools

web gateway

notifications

Fedora badges

datanommer(warehouse)

#fedora-fedmsg

Page 17: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

8

Page 18: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS

9

Page 19: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

MESSAGE SOURCES• Bulk historical data (PostgreSQL dump)• Specific historical records (REST/JSON)• Streaming messages (∅MQ)

10

Page 20: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

11

CREATE TABLE messages ( id integer NOT NULL, i integer NOT NULL, "timestamp" timestamp NOT NULL, certificate text, signature text, topic text, _msg text NOT NULL, category text, source_name text, source_version text, msg_id text);

CREATE TABLE messages ( id integer NOT NULL, i integer NOT NULL, "timestamp" timestamp NOT NULL, certificate text, signature text, topic text, _msg text NOT NULL, category text, source_name text, source_version text, msg_id text);

Page 21: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

11

CREATE TABLE i certificate signature topic _msg category source_name source_version msg_id );

CREATE TABLE messages ( id integer NOT NULL, i integer NOT NULL, "timestamp" timestamp NOT NULL, certificate text, signature text, topic text, _msg text NOT NULL, category text, source_name text, source_version text, msg_id text);

Page 22: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

“I’LL JUST USE JSON!”

12

{ "timestamp": "now", "you-have": [{"problems": 2}]}

Page 23: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

JSON AND SCHEMA INFERENCE

13

{ "animals" : { "duck" : { "mammal": false, "bill": true, "eggs": true }, "zebra" : { "mammal": true, "bill": false, "eggs": false }, "platypus" : { "mammal": true, "bill": true, "eggs": true } } }

Page 24: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

JSON AND SCHEMA INFERENCE

13

{ "animals" : [ {"k": "duck", "v": { "mammal": false, "bill": true, "eggs": true } }, {"k": "zebra", "v": { “mammal": true, "bill": false, "eggs": false }, /* ... */ ] }

Page 25: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

DIVERGING SCHEMAS

14

{ "branches" : ["f22", "master"] }

{ "branches" : [ { "name": "f22", "commit" : "05afffa1" } ] }

Page 26: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

PREPROCESSING WITH JSON4S

15

// use json4s and partial functions // see the Silex library for an easy interface! def renameBranches(msg: JValue) = { msg transformField { case JField("branches", v@JArray(JString(_)::_)) =>

("pkg_branches", v) case JField("branches", v@JArray(JObject(_)::_)) =>

("git_branches", v) } }

Page 27: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS

16

Page 28: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

MESSAGE COUNT DISTRIBUTION

17

100%

40%

60%

80%

10 103 107

message counts (log scale)

user

s w

ith a

t mos

t

105

Page 29: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

UNIQUE TOPIC DISTRIBUTION

18

100%

40%

60%

80%

5 10 15 20 25unique topics

user

s w

ith a

t mos

t

Page 30: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

TOPIC SETS AS FEATURES• Idea: characterize users by the sets of

unique message topics affecting them• Use ML Pipeline transformers to go from

sets (modeled as arrays) to feature vectors

19

Page 31: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

TOPIC SETS AS FEATURES

20

// expose this Spark-private DataType package org.apache.spark.hacks { type VectorType = org.apache.spark.mllib.linalg.VectorUDT }

Page 32: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

TOPIC SETS AS FEATURES

21

trait SVParams extends Params { val inputCol = new Param[String](this, "inputCol", "...") val outputCol = new Param[String](this, "outputCol", "...") val vecSize = new IntParam(this, "vecSize", "...")

// ... }

Page 33: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

TOPIC SETS AS FEATURES

21

trait SVParams extends Params { // ...

def pvals(pm: ParamMap) = ( // 1.3-specific paramMap.get(inputCol).getOrElse("topicSet"), paramMap.get(outputCol).getOrElse("features"), paramMap.get(vecSize).getOrElse(128) ) }

Page 34: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

22

class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT = new org.apache.spark.hacks.VectorType() def transformSchema(schema: StructType, params: ParamMap) = { val outc = paramMap.get(outputCol).getOrElse("features") StructType(schema.fields ++ Seq(StructField(outc, VT, true))) } def transform(df: DataFrame, params: ParamMap) = { val (inc, outc, vs) = pvals(paramMap) df.withColumn(outc, callUDF({ a: Seq[Int] => Vectors.sparse(vs, a.toArray, Array.fill(a.size)(1.0)) }, VT, df(inc))) } }

Page 35: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

CLASSIFYING FEDORA USERS• Problem: can we identify Fedora packager

sponsors by their message topic sets?• Logistic regression was easy to try• ~1% prediction error rate…but only ~10%

of actual sponsors were predicted as such!

23

Page 36: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

A BETTER APPROACH

24

Building a decision tree identified the following message topics as likely to predict that someone was a Fedora packager sponsor:

• org.fedora.prod.fas.role.update• org.fedora.prod.fas.user.update

Page 37: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

BACKGROUND FEDORA MESSAGING BULK DATA INGEST ANALYZING COMMUNITY NEXT STEPS

25

Page 38: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

WHAT’S NEXT• More about fedmsg: http://fedmsg.com• Our Silex library includes helpers for JSON

data ingest: http://silex.freevariable.com• More details about this project:

http://chapeau.freevariable.com1

261 http://chapeau.freevariable.com/2015/06/summit-fedmsg.html

Page 39: DIAGNOSING OPEN- SOURCE COMMUNITY …22 class SetVectorizer(override val uid: String) extends Transformer with SVParams { // hopefully VectorUDT will be public soon! final val VT =

THANKS!27