1
J OINT 3D E STIMATION OF O BJECTS AND S CENE L AYOUT A NDREAS G EIGER C HRISTIAN W OJEK R AQUEL U RTASUN KIT MPI S AARBRÜCKEN TTI C HICAGO O VERVIEW Vehicle Tracklets Scene Labels Vanishing Points We propose a novel generative model that is able to reason jointly about the 3D scene layout and the location / orientation of objects. Input: Urban video sequences (10 s) Monocular, Grayscale @ 10 Hz Features: Vehicle tracklets (tracking-by-detection) Semantic scene labels (joint boosting) Vanishing points (long lines) Inference: Metropolis-Hastings sampling Dynamic programming Output: Street layout (road size, orientation) Object locations and orientations Intersection crossing activities F UTURE W ORK Find more discriminative cues Extend inference (e.g., pedestrians) Holistic scene interpretation incorporat- ing buildings, infrastructure, vegetation and other types of vehicles T OPOLOGY AND G EOMETRY 1 2 3 4 5 6 7 We model street scenes in bird’s eye perspec- tive using 7 scene layouts θ (left) and the fol- lowing geometry parameters R (right): Center of intersection c Street width w Scene rotation r Crossing street angle α We model the set of possible vehicle locations with lanes connecting the streets and parking spots at the road side: We have: K streets K (K - 1) lanes 2K parking spots P ROBABILISTIC M ODEL Our goal is to estimate the most likely configuration R =(θ, c, w, r, α) given the image evidence E = {T, V, S}, which comprises vehicle tracklets T = {t 1 , .., t N }, vanishing points V = {v f ,v c } and semantic labels S. Given R we assume all observations to be independent: p(E , R)= p(R) | {z } Prior " N Y n=1 X l n p(t n ,l n |R) # | {z } Vehicle Tracklets p(v f |R) p(v c |R) | {z } Vanishing Points p(S|R) | {z } Semantic Labels R ESULTS Automatically inferred scene descriptions: Tracklets from all frames superimposed (left). Infer- ence result with θ known (middle) and θ unknown (right). The inferred intersection layout is shown in gray, ground truth labels are given in blue and detected activities are depicted in red. Topology known Topology unknown Oracle Topology Location Orientation Overlap Activity GP MKL 6.0 m 9.6 deg 44.9 % 18.4 % Ours 5.8 m 5.9 deg 53.0 % 11.5 % GP MKL 27.4 % 6.2 m 21.7 deg 39.3 % 28.1 % Ours 70.8 % 6.6 m 7.2 deg 48.1 % 16.6 % Stereo 92.9 % 4.4 m 6.6 deg 62.7 % 8.0 % Topology and Geometry Estimation: Errors / Accuracies with respect to GP-Baseline and Stereo Error Felzenszwalb et al. (raw) 32.6 Felzenszwalb et al. (smoothed) 31.2 Our method (θ unknown) 15.7 Our method (θ known) 13.7 Orientation Estimation: Compared to state-of-the-art object detection we decrease angular orien- tation errors of moving objects by using context from our model. Measured in bird’s eye view. P RIOR p(R) = p(θ )p(c,w )p(r )p(α) θ δ (θ MAP ) (c, log w ) T N (μ cw , Σ cw ) r N (μ r r ) α p(α) Road parameters: R =(θ, c, w, r, α) Scene layout: θ ∈{1, .., 7} Intersection center: c R 2 Street width: w R + Scene rotation: r [- π 4 , π 4 ] Crossing street angle: α [- π 4 , π 4 ] V EHICLE T RACKLETS p(t|l, R) = X s 1 ,..,s M p(s 1 )p(d 1 |s 1 , l, R) M Y m=2 p(s m |s m-1 )p(d m |s m , l, R) p(d|s, l, R) = p(o|s, l, R)p(b|s, l, R) p(o|s, l, R) = Mult(o|π o ) p(b|s, l, R) N (π (b)|μ b , Σ b )+ const Tracklet: t = {d 1 , .., d M } Lane: l ∈{1, .., K (K - 1) + 2K } Location on lane: s N Detection: d =(o ∈{1, .., 8}, b R 4 ) V ANISHING P OINTS p(v f |R) δ f N (v f |μ f f )+ const p(v c |R) δ c N (v c |μ c c )+ const Forward vanishing point: v f R Crossing vanishing point: v c [- π 4 , π 4 ] Existance variables: δ f c ∈{0, 1} S EMANTIC L ABELS p(S|R) exp(γ X c X (u,v )∈S c S (c) u,v ) Class: c ∈{road, building, sky } Pixels of reprojected model: S c Scene label probability: S u,v

3D E - Cvlibs · 2016. 3. 25. · JOINT 3D ESTIMATION OF OBJECTS AND SCENE LAYOUT ANDREAS GEIGER CHRISTIAN WOJEK RAQUEL URTASUN KIT MPI SAARBRÜCKEN TTI CHICAGO OVERVIEW We propose

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • JOINT 3D ESTIMATION OFOBJECTS AND SCENE LAYOUT

    ANDREAS GEIGER CHRISTIAN WOJEK RAQUEL URTASUNKIT MPI SAARBRÜCKEN TTI CHICAGO

    OVERVIEWVehicle Tracklets

    Scene Labels

    Vanishing Points

    We propose a novel generative model thatis able to reason jointly about the 3D scenelayout and the location / orientation of objects.

    Input:

    • Urban video sequences (∼10 s)• Monocular, Grayscale @ 10 Hz

    Features:

    • Vehicle tracklets (tracking-by-detection)• Semantic scene labels (joint boosting)• Vanishing points (long lines)

    Inference:

    • Metropolis-Hastings sampling• Dynamic programming

    Output:

    • Street layout (road size, orientation)• Object locations and orientations• Intersection crossing activities

    FUTURE WORK• Find more discriminative cues• Extend inference (e.g., pedestrians)• Holistic scene interpretation incorporat-

    ing buildings, infrastructure, vegetationand other types of vehicles

    TOPOLOGY AND GEOMETRY1 2 3

    4 5 6

    7

    We model street scenes in bird’s eye perspec-tive using 7 scene layouts θ (left) and the fol-lowing geometry parametersR (right):

    • Center of intersection c• Street width w• Scene rotation r• Crossing street angle α

    We model the set of possible vehicle locationswith lanes connecting the streets and parkingspots at the road side:

    We have:• K streets• K(K − 1) lanes• 2K parking spots

    PROBABILISTIC MODELOur goal is to estimate the most likely configuration R = (θ, c, w, r, α) given the image evidenceE = {T,V,S}, which comprises vehicle tracklets T = {t1, .., tN}, vanishing points V = {vf , vc} andsemantic labels S. GivenRwe assume all observations to be independent:

    p(E ,R) = p(R)︸ ︷︷ ︸Prior

    [N∏n=1

    ∑ln

    p(tn, ln|R)

    ]︸ ︷︷ ︸

    Vehicle Tracklets

    p(vf |R) p(vc|R)︸ ︷︷ ︸Vanishing Points

    p(S|R)︸ ︷︷ ︸Semantic Labels

    RESULTS

    Automatically inferred scene descriptions: Tracklets from all frames superimposed (left). Infer-ence result with θ known (middle) and θ unknown (right). The inferred intersection layout isshown in gray, ground truth labels are given in blue and detected activities are depicted in red.

    Topologyknown

    Topologyunknown

    Oracle

    Topology Location Orientation Overlap ActivityGP MKL – 6.0 m 9.6 deg 44.9 % 18.4 %Ours – 5.8 m 5.9 deg 53.0 % 11.5 %GP MKL 27.4 % 6.2 m 21.7 deg 39.3 % 28.1 %Ours 70.8 % 6.6 m 7.2 deg 48.1 % 16.6 %Stereo 92.9 % 4.4 m 6.6 deg 62.7 % 8.0 %

    Topology and Geometry Estimation: Errors / Accuracies with respect to GP-Baseline and StereoError

    Felzenszwalb et al. (raw) 32.6 ◦

    Felzenszwalb et al. (smoothed) 31.2 ◦

    Our method (θ unknown) 15.7 ◦

    Our method (θ known) 13.7 ◦

    Orientation Estimation: Compared to state-of-the-art object detection we decrease angular orien-tation errors of moving objects by using context from our model. Measured in bird’s eye view.

    PRIOR

    p(R) = p(θ)p(c, w)p(r)p(α)θ ∼ δ(θMAP )

    (c, logw)T ∼ N (µcw,Σcw)r ∼ N (µr, σr)α ∼ p(α)

    • Road parameters: R = (θ, c, w, r, α)• Scene layout: θ ∈ {1, .., 7}• Intersection center: c ∈ R2• Street width: w ∈ R+• Scene rotation: r ∈ [−π4 ,

    π4 ]

    • Crossing street angle: α ∈ [−π4 ,π4 ]

    VEHICLE TRACKLETS

    p(t|l,R) =∑

    s1,..,sM

    p(s1)p(d1|s1, l,R)

    M∏m=2

    p(sm|sm−1)p(dm|sm, l,R)

    p(d|s, l,R) = p(o|s, l,R)p(b|s, l,R)p(o|s, l,R) = Mult(o|πo)p(b|s, l,R) ∝ N (π(b)|µb,Σb) + const

    • Tracklet: t = {d1, ..,dM}• Lane: l ∈ {1, ..,K(K − 1) + 2K}• Location on lane: s ∈ N• Detection: d = (o ∈ {1, .., 8},b ∈ R4)

    VANISHING POINTS

    p(vf |R) ∝ δf N (vf |µf , σf ) + constp(vc|R) ∝ δc N (vc|µc, σc) + const

    • Forward vanishing point: vf ∈ R• Crossing vanishing point: vc ∈ [−π4 ,

    π4 ]

    • Existance variables: δf , δc ∈ {0, 1}

    SEMANTIC LABELS

    p(S|R) ∝ exp(γ∑c

    ∑(u,v)∈Sc

    S(c)u,v)

    • Class: c ∈ {road, building, sky}• Pixels of reprojected model: Sc• Scene label probability: Su,v