7
Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing Li [0000-0003-1185-5365] , Siyuan Huang [0000-0003-1524-7148] , Yining Hong [0000-0002-0518-2099] , and Song-Chun Zhu [0000-0002-1925-5973] UCLA Center for Vision, Cognition, Learning, and Autonomy (VCLA) {liqing, huangsiyuan, yininghong}@ucla.edu, [email protected] 1 Training Details Scene Parsing: Following [58], we generate object proposals with pre-trained Mask R-CNN. The Mask R-CNN module is pre-trained on 4k generated CLEVR images with bounding box annotations only. We use ResNet-50 FPN as the backbone and train the model for 30k iterations with a batch size of 8. Question Parser and Concept Embedding: For the question parser, both the encoder and decoder are LSTMs of two hidden layers with a 256-dim hidden vector. The dimension of the word embedding is 300. The question parser is pre-trained with 1k randomly selected question-program pairs. Please refer to Appendix A in [41] for the specification of the domain-specific language (DSL) designed for the CLEVR dataset to represent the programs. Following [41], we set the dimension of the concept embedding as 64. During the joint optimization of concept embedding and question parser, we adopt the Adam optimizer [31] with a fixed learning rate of 0.001, and the batch size is 64. Multi-dimensional IRT (mIRT): The mIRT model is implemented in Pyro [8], which is a probabilistic programming framework using PyTorch as the backend. We train the mIRT model using an Adam optimizer with a learning rate of 0.1. The training of the mIRT model converges fast and usually in less than 1000 iter- ations, therefore the running time is negligible compared to the time of training the visual concept learner. Training Steps: The length of each training epoch is determined by the number of selected questions at this epoch. Questions are selected by the proposed train- ing sample selection strategy, as illustrated in Section 3.5. We train the model from the easiest samples. Specifically, we select 5k samples with less than two concepts as the starting questions. As shown in Figure 1, the number of selected questions grows along with the increasing model competence. In the end, the model selected the few hardest questions and then converges, which also causes early stop since no question is selected in the next epoch. Similarly, Figure 2 shows the accuracy of each concept at various iterations. Training Speed: We train the model on a single Nvidia TITAN RTX card, and the entire convergence time is about 10 hours, with 21 epochs (about 11k iterations). All our models are implemented in PyTorch.

Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

Supplementary Materials forA Competence-aware Curriculum for VisualConcepts Learning via Question Answering

Qing Li[0000−0003−1185−5365], Siyuan Huang[0000−0003−1524−7148], YiningHong[0000−0002−0518−2099], and Song-Chun Zhu[0000−0002−1925−5973]

UCLA Center for Vision, Cognition, Learning, and Autonomy (VCLA){liqing, huangsiyuan, yininghong}@ucla.edu, [email protected]

1 Training Details

Scene Parsing: Following [58], we generate object proposals with pre-trainedMask R-CNN. The Mask R-CNN module is pre-trained on 4k generated CLEVRimages with bounding box annotations only. We use ResNet-50 FPN as thebackbone and train the model for 30k iterations with a batch size of 8.Question Parser and Concept Embedding: For the question parser, boththe encoder and decoder are LSTMs of two hidden layers with a 256-dim hiddenvector. The dimension of the word embedding is 300. The question parser ispre-trained with 1k randomly selected question-program pairs. Please refer toAppendix A in [41] for the specification of the domain-specific language (DSL)designed for the CLEVR dataset to represent the programs. Following [41], weset the dimension of the concept embedding as 64. During the joint optimizationof concept embedding and question parser, we adopt the Adam optimizer [31]with a fixed learning rate of 0.001, and the batch size is 64.Multi-dimensional IRT (mIRT): The mIRT model is implemented in Pyro [8],which is a probabilistic programming framework using PyTorch as the backend.We train the mIRT model using an Adam optimizer with a learning rate of 0.1.The training of the mIRT model converges fast and usually in less than 1000 iter-ations, therefore the running time is negligible compared to the time of trainingthe visual concept learner.Training Steps: The length of each training epoch is determined by the numberof selected questions at this epoch. Questions are selected by the proposed train-ing sample selection strategy, as illustrated in Section 3.5. We train the modelfrom the easiest samples. Specifically, we select 5k samples with less than twoconcepts as the starting questions. As shown in Figure 1, the number of selectedquestions grows along with the increasing model competence. In the end, themodel selected the few hardest questions and then converges, which also causesearly stop since no question is selected in the next epoch. Similarly, Figure 2shows the accuracy of each concept at various iterations.Training Speed: We train the model on a single Nvidia TITAN RTX card,and the entire convergence time is about 10 hours, with 21 epochs (about 11kiterations). All our models are implemented in PyTorch.

Page 2: Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

2 Q. Li et al.

0k 100k 200k 300k 400k 500k 600k 700k 800kNumber of training samples (= batch size * iterations)

2

4

6

8

10

12

14

16

Number o

f concepts

Fig. 1. The average number of concepts of selected questions smoothly increases duringtraining, which suggests that the training follows an easy-to-hard curriculum.

2 Visualization of Selected Questions

100k 200k 300k 400k 500k 600k 700k 800k0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

color

grayredbluegreen

brownpurplecyanyellow

100k 200k 300k 400k 500k 600k 700k 800k0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

material

rubbermetal

100k 200k 300k 400k 500k 600k 700k 800k0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

shape

cubespherecylinder

100k 200k 300k 400k 500k 600k 700k 800k0.5

0.6

0.7

0.8

0.9

1.0

Accu

racy

size

smalllarge

Fig. 2. The accuracy of each concept at various iterations. The concepts are groupedby the attribute type.

Figure 3 shows model responses for the selected questions at various itera-tions. They represent the smooth improvements for the question difficulty andmodel competence during the training process. Specifically, in the early stages oftraining, the model selects easy questions in simple scenes, which only involvesone or two concepts. Following the increase of model competence, the selectionstrategy starts to tackle hard questions in complex scenes, consisting of multi-ple concepts with spatial relationships. Without any extra prior knowledge, this

Page 3: Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

Competence-aware Curriculum for Visual Concepts Learning 3

Q: How many shiny objects are there? A: 2

Q: What number of other objects are the same material as the brown object? A: 1

Q: Is there anything else that has the same shape as the purple thing? A: no

Q: What material is the small object? A: rubber

Q: What size is the cylinder? A: small

Q: Are there fewer metallic spheres than large cyan objects? A: no

Q: Is the number of blue things right of metal cylinder less than the number of gray things? A: yes

Q: What is the size of the cylinder in front of the red shiny object? A: large

Q: The big object that is the same color as the big metal ball is what shape? A: cylinder

Iteration 5k Iteration 75k Iteration 400k Iteration 600k Iteration 750k

Q: The purple cube that is the same material as the purple cylinder is what size? A: large

Q: How many spheres are both to the right of the big metallic block and behind the tiny cyan block? A: 1

Q: How many green matte objects are left of the blue shiny thing? A: 1

Q: How many red objects are the same shape as the small yellow thing? A: 0

Q: What number of other objects are the same material as the large blue cube? A: 6

Q: There is a blue thing behind the big matte cylinder; what is its material? A: metal

Fig. 3. Example questions selected at different iterations (LB=-5, UB=-0.75). Theproposed model selects increasingly complex questions during the training progress.It starts the learning with simple questions with one or two concepts and moves tocomplex ones involving combined concepts with spatial relationships.

easy-to-hard learning process shows its smoothness and efficiency with automaticguidance from the proposed curriculum.

3 Qualitative Examples of NSCL

Figure 4 visualizes several examples of the symbolic reasoning process by theneural-symbolic concept learner. The questions of the first three examples arecorrectly answered by our model, and the last example is a typical error casecaused by a small object under heavy occlusion.

Page 4: Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

4 Q. Li et al.

There is a blue thing behind the big matte cylinder; what is its material?

Query[material]Filter[big, matte,cylinder] Relate[behind] Filter[blue]

Rubber (0.84)GT: metal

There is a tiny shiny sphere left of the brown cube; what color is it?

Filter[brown, cube] Relate[left] Filter[tiny, shiny,sphere]

Does the large purple shiny object have the same shape as the tiny object that is behind the matte thing?

Filter[large, purple, shiny] Filter[matte] Relate[behind] Filter[tiny]

No (0.93)

AEQuery[shape]

Brown(0.91)

There is a blue cylinder in front of the cyan shiny object; are there any gray cubes that are to the left of it? Filter[cyan, shiny] Relate[front] Filter[blue, cylinder] Relate[left] Filter[gray, cube]

Yes (0.98)

Query[color]

Exist

Fig. 4. Visualization of the symbolic reasoning process by the neural-symbolic conceptlearner on the CLEVR dataset. The questions of the first three examples are correctlyanswered by our model, and the last example is a typical error case caused by a smallobject under heavy occlusion.

Page 5: Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

Competence-aware Curriculum for Visual Concepts Learning 5

References

1. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang,L.: Bottom-up and top-down attention for image captioning and visual questionanswering. In: Proceedings of the IEEE conference on computer vision and patternrecognition. pp. 6077–6086 (2018)

2. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. Con-ference on Computer Vision and Pattern Recognition (CVPR) pp. 39–48 (2015)

3. Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neuralnetworks for question answering. In: Proceedings of the 2016 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies (2016)

4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. ICLR (2015)

5. Baker, F.B.: The basics of item response theory. ERIC (2001)

6. Baker, F.B., Kim, S.H.: Item response theory: Parameter estimation techniques.CRC Press (2004)

7. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: In-ternational Conference on Machine Learning (ICML) (2009)

8. Bingham, E., Chen, J.P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karalet-sos, T., Singh, R., Szerlip, P., Horsfall, P., Goodman, N.D.: Pyro: Deep UniversalProbabilistic Programming. Journal of Machine Learning Research (2018)

9. Bock, R.D., Aitkin, M.: Marginal maximum likelihood estimation of item param-eters: Application of an em algorithm. Psychometrika (1981)

10. Chrupa la, G., Kadar, A., Alishahi, A.: Learning language through pictures. Asso-ciation for Computational Linguistics(ACL) (2015)

11. Dasgupta, S., Hsu, D., Poulis, S., Zhu, X.: Teaching a black-box learner. In: ICML(2019)

12. Dong, L., Lapata, M.: Language to logical form with neural attention. ACL (2016)

13. Elman, J.L.: Learning and development in neural networks: The importance ofstarting small. Cognition (1993)

14. Embretson, S.E., Reise, S.P.: Item response theory. Psychology Press (2013)

15. Fan, Y., et al.: Learning to teach. ICLR (2018)

16. Fazly, A., Alishahi, A., Stevenson, S.: A probabilistic computational model ofcross-situational word learning. Annual Meeting of the Cognitive Science Society(CogSci) (2010)

17. Gan, C., Li, Y., Li, H., Sun, C., Gong, B.: Vqs: Linking segmentations to ques-tions and answers for supervised attention in vqa and question-focused semanticsegmentation. In: ICCV. pp. 1811–1820 (2017)

18. Gauthier, J., Levy, R., Tenenbaum, J.B.: Word learning and the acquisition ofsyntactic-semantic overhypotheses. Annual Meeting of the Cognitive Science Soci-ety (CogSci) (2018)

19. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqamatter: Elevating the role of image understanding in visual question answering. In:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 6904–6913 (2017)

20. Graves, A., Bellemare, M.G., Menick, J., Munos, R., Kavukcuoglu, K.: Automatedcurriculum learning for neural networks. In: International Conference on MachineLearning (ICML) (2017)

Page 6: Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

6 Q. Li et al.

21. Guo, S., Huang, W., Zhang, H., Zhuang, C., Dong, D., Scott, M.R., Huang, D.:Curriculumnet: Weakly supervised learning from large-scale web images. ArXivabs/1808.01097 (2018)

22. He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask r-cnn. In: Conference on Com-puter Vision and Pattern Recognition (CVPR) (2017)

23. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

24. Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason:End-to-end module networks for visual question answering. International Confer-ence on Computer Vision (ICCV) pp. 804–813 (2017)

25. Hudson, D.A., Manning, C.D.: Compositional attention networks for machine rea-soning. In: International Conference on Learning Representations (ICLR) (2018)

26. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoningand compositional question answering. CVPR (2019)

27. Jiang, L., et al.: Self-paced learning with diversity. In: NIPS (2014)

28. Jiang, L., et al.: Self-paced curriculum learning. In: AAAI (2015)

29. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C.,Girshick, R.: Clevr: A diagnostic dataset for compositional language and elemen-tary visual reasoning. In: Conference on Computer Vision and Pattern Recognition(CVPR) (2017)

30. Johnson, J., Hariharan, B., van der Maaten, L., Hoffman, J., Li, F.F., Zitnick, C.,Girshick, R.: Inferring and executing programs for visual reasoning. In: Interna-tional Conference on Computer Vision (ICCV) (2017)

31. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: InternationalConference on Learning Representations (ICLR) (2015)

32. Krueger, K.A., Dayan, P.: Flexible shaping: How learning in small steps helps.Cognition (2009)

33. Kumar, M.P., et al.: Self-paced learning for latent variable models. In: NIPS (2010)

34. Lalor, J.P., Wu, H., Yu, H.: Building an evaluation scale using item response the-ory. Conference on Empirical Methods in Natural Language Processing (EMNLP)(2016)

35. Lalor, J.P., Wu, H., Yu, H.: Learning latent parameters without human responsepatterns: Item response theory with artificial crowds. Conference on EmpiricalMethods in Natural Language Processing (EMNLP) (2019)

36. Liang, C., Berant, J., Le, Q., Forbus, K.D., Lao, N.: Neural symbolic machines:Learning semantic parsers on freebase with weak supervision. In: ACL (2016)

37. Liang, C., Norouzi, M., Berant, J., Le, Q., Lao, N.: Memory augmented policyoptimization for program synthesis and semantic parsing. In: NIPS (Jul 2018)

38. Liu, W., Dai, B., Humayun, A., Tay, C., Yu, C., Smith, L.B., Rehg, J.M., Song, L.:Iterative machine teaching. In: Proceedings of the 34th International Conferenceon Machine Learning-Volume 70. pp. 2149–2158. JMLR. org (2017)

39. Malinowski, M., Fritz, M.: A multi-world approach to question answering aboutreal-world scenes based on uncertain input. In: Advances in Neural InformationProcessing Systems (NeurIPS) (2014)

40. Mansouri, F., Chen, Y., Vartanian, A., Zhu, X., Singla, A.: Preference-based batchand sequential teaching: Towards a unified view of models. In: NeurIPS (2019)

41. Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic con-cept learner: Interpreting scenes, words, and sentences from natural supervision.International Conference on Learning Representations (ICLR) (2019)

Page 7: Supplementary Materials for A Competence-aware Curriculum ... · Supplementary Materials for A Competence-aware Curriculum for Visual Concepts Learning via Question Answering Qing

Competence-aware Curriculum for Visual Concepts Learning 7

42. Misra, I., Girshick, R.B., Fergus, R., Hebert, M., Gupta, A., van der Maaten,L.: Learning by asking questions. Conference on Computer Vision and PatternRecognition (CVPR) (2017)

43. Natesan, P., Nandakumar, R., Minka, T., Rubright, J.D.: Bayesian prior choice inirt estimation using mcmc and variational bayes. Frontiers in psychology (2016)

44. Pentina, A., Sharmanska, V., Lampert, C.H.: Curriculum learning of multipletasks. Conference on Computer Vision and Pattern Recognition (CVPR) pp. 5492–5500 (2014)

45. Perez, E., Strub, F., de Vries, H., Dumoulin, V., Courville, A.C.: Film: Visualreasoning with a general conditioning layer. In: AAAI Conference on ArtificialIntelligence (AAAI) (2017)

46. Peterson, G.B.: A day of great illumination: Bf skinner’s discovery of shaping.Journal of the experimental analysis of behavior (2004)

47. Platanios, E.A., Stretcu, O., Neubig, G., Poczos, B., Mitchell, T.M.: Competence-based curriculum learning for neural machine translation. In: North AmericanChapter of the Association for Computational Linguistics(NAACL-HLT) (2019)

48. Reckase, M.D.: The difficulty of test items that measure more than one ability.Applied psychological measurement (1985)

49. Reckase, M.D.: Multidimensional item response theory models. In: Multidimen-sional item response theory (2009)

50. Sachan, M., et al.: Easy questions first? a case study on curriculum learning forquestion answering. In: ACL (2016)

51. Skinner, B.F.: Reinforcement today. American Psychologist (1958)52. Spitkovsky, V.I., Alshawi, H., Jurafsky, D.: From baby steps to leapfrog: How less

is more in unsupervised dependency parsing. In: Human Language Technologies:The 2010 Annual Conference of the North American Chapter of the Association forComputational Linguistics. pp. 751–759. Association for Computational Linguistics(2010)

53. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neuralnetworks. In: Advances in neural information processing systems. pp. 3104–3112(2014)

54. Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B., Dyer, C.: Learning the cur-riculum with bayesian optimization for task-specific word representation learning.ACL (2016)

55. Williams, R.J.: Simple statistical gradient-following algorithms for connectionistreinforcement learning. Machine learning (1992)

56. Wu, L., et al.: Learning to teach with dynamic loss functions. In: NeurIPS (2018)57. Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: Clevrer:

Collision events for video representation and reasoning. ICLR (2020)58. Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic

vqa: Disentangling reasoning from vision and language understanding. In: Ad-vances in Neural Information Processing Systems (2018)

59. Zhu, X.: Machine teaching: An inverse problem to machine learning and an ap-proach toward optimal education. In: Twenty-Ninth AAAI Conference on ArtificialIntelligence (2015)

60. Zhu, X., Singla, A., Zilles, S., Rafferty, A.N.: An overview of machine teaching.arXiv preprint arXiv:1801.05927 (2018)