8
Towards Learning to Play Piano with Dexterous Hands and Touch Huazhe Xu 1,2 , Yuping Luo 3 , Shaoxiong Wang 4 , Trevor Darrell 1 , Roberto Calandra 2 Abstract— The virtuoso plays the piano with passion, poetry and extraordinary technical ability. As Liszt said “(a virtuoso) must call up scent and blossom, and breathe the breath of life.” The strongest robots that can play a piano are based on a combination of specialized robot hands/piano and hardcoded planning algorithms. In contrast to that, in this paper, we demonstrate how an agent can learn directly from machine- readable music score to play the piano with dexterous hands on a simulated piano using reinforcement learning (RL) from scratch. We demonstrate the RL agents can not only find the correct key position but also deal with various rhythmic, volume and fingering, requirements. We achieve this by using a touch-augmented reward and a novel curriculum of tasks. We conclude by carefully studying the important aspects to enable such learning algorithms and that can potentially shed light on future research in this direction. I. INTRODUCTION Piano playing is a complex sensorimotor task that requires precise control of your hand. Successful performance in- volves interacting with a combination of keys with the right amount of force, and accurate timing. To control their hands and body for expressive piano performance, it takes human years and decades to master this skill; hence, it draws a critical line between everyday tasks such as objects grasping and manipulation and piano playing. Enabling robot hands to play the piano is a challenging and meaningful task that combines multiple lines of research. Reinforcement Learn- ing (RL) has been proved to outperform humans in games such as Atari [1], Go [2] and achieve impressive results on continuous control tasks. However, RL has seldomly been applied on instrument playing tasks. This can be attributed to three aspects: 1) The lack of simulation environments for instrument playing. 2) The lack of tactile sensory input, which play an important role in the modulation of sounds in many instruments. 3) The challenging nature of these tasks. In this paper, we explore the possibly that RL agents play the piano with the help of image-based tactile sensory input. Before RL is introduced to this task, there were several attempts to solve the automatic piano playing task. Many of them [3], [4], [5], [6], [7] relied heavily on specific design of robot hand or mechanical setups, with which the task becomes overly simplified. Another branch of works [8], [9], [10], [11] tried to solve the task with commercial robot 1 University of California, Berkeley, CA, USA huazhe [email protected], [email protected] 2 Facebook AI Research, Menlo Park, CA, USA [email protected] 3 Princeton University, Princeton, NJ, USA [email protected] 4 MIT, USA wang [email protected] Policy Compute Reward Sheet Music Input Audio to Notes Touch Feedback Action Simulated Piano Fig. 1: Playing the piano is intrinsically a multi-modal task involving vision, audio and touch. Using reinforcement learning, we demonstrate that it is possible to learn agents that play the piano by exploiting tactile sensors for control and using the corresponding notes generated as reward. hands. However, most of them were manually ordered and hence lack of the ability to adapt and generalize to new pieces. [12] used a human-like hand to play the piano and achieves impressive result; however, the fingers are passive and lead to inability to perform complex scores. In this work, we study how to leverage reinforcement learning algorithms and tactile sensor to tackle piano playing tasks adaptively, in a simulated environment, as sketched in Fig. 1. We first formulate the piano playing task as a Markov Decision Process (MDP) so that we can apply reinforcement learning algorithms. We also explore the possibilities of reward function design which is usually hard to choose to empower a successful RL algorithm. More specifically, we train an end-to-end policy to directly output actions for controlling the hand joints and accomplishing what is presented in the music score. We further study how to integrate rich tactile sensory input to the proposed method as well as to what extent tactile data help with the task. Our contribution in this work can be summarized in three points: 1) Based on the TACTO touch simulator [13], we build a hand-piano simulation which includes a multi-finger allegro hand equipped with tactile sensors, and a small size piano keyboard. The piano automatically translate the phys- ical movement of keys into midi signals. The environment can be used and extended to test future algorithms for piano playing tasks. 2) We are the first to formulate the piano playing task as an MDP and use reinforcement learning algorithms on simulated piano playing tasks. This opens a welcoming avenue for benchmarking state-of-the-art RL algorithms on complex and artistic tasks. 3) We propose to use high-resolution vision-based tactile arXiv:2106.02040v2 [cs.RO] 8 Jun 2021

Towards Learning to Play Piano with Dexterous Hands and Touch

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Towards Learning to Play Piano with Dexterous Hands and Touch

Huazhe Xu1,2, Yuping Luo3, Shaoxiong Wang4, Trevor Darrell1, Roberto Calandra2

Abstract— The virtuoso plays the piano with passion, poetryand extraordinary technical ability. As Liszt said “(a virtuoso)must call up scent and blossom, and breathe the breath oflife.” The strongest robots that can play a piano are based on acombination of specialized robot hands/piano and hardcodedplanning algorithms. In contrast to that, in this paper, wedemonstrate how an agent can learn directly from machine-readable music score to play the piano with dexterous handson a simulated piano using reinforcement learning (RL) fromscratch. We demonstrate the RL agents can not only findthe correct key position but also deal with various rhythmic,volume and fingering, requirements. We achieve this by using atouch-augmented reward and a novel curriculum of tasks. Weconclude by carefully studying the important aspects to enablesuch learning algorithms and that can potentially shed light onfuture research in this direction.

I. INTRODUCTION

Piano playing is a complex sensorimotor task that requiresprecise control of your hand. Successful performance in-volves interacting with a combination of keys with the rightamount of force, and accurate timing. To control their handsand body for expressive piano performance, it takes humanyears and decades to master this skill; hence, it draws acritical line between everyday tasks such as objects graspingand manipulation and piano playing. Enabling robot handsto play the piano is a challenging and meaningful task thatcombines multiple lines of research. Reinforcement Learn-ing (RL) has been proved to outperform humans in gamessuch as Atari [1], Go [2] and achieve impressive results oncontinuous control tasks. However, RL has seldomly beenapplied on instrument playing tasks. This can be attributedto three aspects: 1) The lack of simulation environmentsfor instrument playing. 2) The lack of tactile sensory input,which play an important role in the modulation of sounds inmany instruments. 3) The challenging nature of these tasks.In this paper, we explore the possibly that RL agents playthe piano with the help of image-based tactile sensory input.

Before RL is introduced to this task, there were severalattempts to solve the automatic piano playing task. Many ofthem [3], [4], [5], [6], [7] relied heavily on specific designof robot hand or mechanical setups, with which the taskbecomes overly simplified. Another branch of works [8],[9], [10], [11] tried to solve the task with commercial robot

1 University of California, Berkeley, CA, USAhuazhe [email protected],[email protected]

2 Facebook AI Research, Menlo Park, CA, [email protected]

3 Princeton University, Princeton, NJ, [email protected]

4 MIT, USAwang [email protected]

Policy

ComputeReward

Sheet Music Input

Audio to Notes

Touch Feedback

Action

Simulated Piano

Fig. 1: Playing the piano is intrinsically a multi-modaltask involving vision, audio and touch. Using reinforcementlearning, we demonstrate that it is possible to learn agentsthat play the piano by exploiting tactile sensors for controland using the corresponding notes generated as reward.

hands. However, most of them were manually ordered andhence lack of the ability to adapt and generalize to newpieces. [12] used a human-like hand to play the piano andachieves impressive result; however, the fingers are passiveand lead to inability to perform complex scores.

In this work, we study how to leverage reinforcementlearning algorithms and tactile sensor to tackle piano playingtasks adaptively, in a simulated environment, as sketched inFig. 1.

We first formulate the piano playing task as a MarkovDecision Process (MDP) so that we can apply reinforcementlearning algorithms. We also explore the possibilities ofreward function design which is usually hard to chooseto empower a successful RL algorithm. More specifically,we train an end-to-end policy to directly output actionsfor controlling the hand joints and accomplishing what ispresented in the music score. We further study how tointegrate rich tactile sensory input to the proposed methodas well as to what extent tactile data help with the task.

Our contribution in this work can be summarized in threepoints: 1) Based on the TACTO touch simulator [13], webuild a hand-piano simulation which includes a multi-fingerallegro hand equipped with tactile sensors, and a small sizepiano keyboard. The piano automatically translate the phys-ical movement of keys into midi signals. The environmentcan be used and extended to test future algorithms for pianoplaying tasks. 2) We are the first to formulate the pianoplaying task as an MDP and use reinforcement learningalgorithms on simulated piano playing tasks. This opensa welcoming avenue for benchmarking state-of-the-art RLalgorithms on complex and artistic tasks. 3)

We propose to use high-resolution vision-based tactile

arX

iv:2

106.

0204

0v2

[cs

.RO

] 8

Jun

202

1

sensors to improve piano playing especially on fingeringindication, which provide useful knowledge on how muchtactile data would help on such tasks and thus be adopted asa common paradigm on related future tasks.

II. RELATED WORK

Piano playing has been studied for decades as both a real-world task for robot hand control and a stepping stone towardmore general applications. Starting from the early playerpianos called Pianola [3] that requires pumping pedal withrecorded score on paper rolls, researchers have developedrobotics systems that enable a robot hand and infer therequired action based on the input signals [4], [8], [14].To enhance the robustness of such systems, various effortsare made such as adding a rail to control the degree offreedom [5], [6], reducing the number of fingers [15] foreasier control or even adding more fingers [16] on top ofeach of the keys. Some works also explore the use of ahuman like hand to play the piano; however, the hand iseither passive without force control on the fingers [12] oromit the velocity for simplicity [17]. The reverse problemis also well-studied in the previous work that tracks pianoperformance based on human players [18]. In contrast tomost of the previous work, our method uses a dexterous handthat is not designed for piano playing to enable a learningagent to play the piano with the help of multi-modal sensorysignals. When evaluating the performance, we also take theresulting key velocity into consideration for artistic playingwhere only limited work [19] also focus on such aspect.

It is also the first time reinforcement learning is usedto enable a robot hand to play the piano without manuallydesigned motion planning algorithms. In previous work [20],[21], RL has been used to accomplish manipulation taskssuch as open door or object manipulation. The recentwork [22] has also demonstrate simulator trained RL policiescan be transfered to real robot hands on rubik’s cube ma-nipulation. While RL has been successful applied on manyrobotics hand control tasks, most of them only require theagent to perform correct actions. However, our work is tryingto design effective RL algorithm to achieve not only correctactions but also temporal correctness and force in instrumentplaying tasks. Such tasks are drastically different from theprevious ones and can be used as service robot or delegatefor harder dexterous hand control tasks.

Vision-based tactile sensors [23], [24] has been developedto provide high-resolution touch signals with high frame rate.In previous works, they demonstrated their potential to im-prove the quality for robot hands in sensorimotor tasks [25],[24]. Recent advances also developed robot systems thatcan control hands to perform challenging tasks such asmanipulating cables [26] or swinging a bottle [27]. However,in this paper, we rely on the tactile sensor for recognizingwhich finger is being used. We also find augmenting thewhole system with tactile sensors would make the algorithmconverge slightly fast due to the additional information.Thanks to the opensourced tactile sensor TACTO [13], we

can incorporate virtual sensors with the robot hand for betterpiano performance.

III. REINFORCEMENT LEARNING OF PIANOPLAYING

In this section, we introduce our system for enabling theagent to play the piano. In Fig. 2, it uses a schematicdiagram to demonstrate the system. In Section III-A, wefirst formulate the task as a Markov decision process anddetail about the components of the MDP. In Section III-B, weintroduce the representation for music score. In Section III-C, we introduce the offline RL algorithm we used for thistask, the neural network architecture and training procedure.

A. Formulation

We formalize piano playing as a Markov decision pro-cess (MDP) [28] where we select the action that minimizesthe difference between the attacked key and target key from adigital music score. In this MDP, at every time step t, a robothas the observation ot from the observation space O. Theagent is supposed to choose an action at from the actionspace A that directs the gripper to a new pose relative toits current pose. The unknown transition of the environ-ment(piano) can be expressed with probability density of thenext observation p : O×A×O → [0,∞). The environmentwill emit a bounded reward r : O × A → [rmin, rmax]. Thereward is computed based on human priors about how toevaluate the played keys when the target music score isavailable.

Under this formulation, we describe the observation space,action space, reward structure below:

1) Observation Space: The robot’s observation spaceincludes recent history and image-based tactile sensoryoutput from each finger, robot hand position, robot fin-ger joint angles, velocities, piano key angles, velocities:ot = {otacto

t−1 , otactot , ohand position

t , ohand joint anglest , ohand joint vel

t ,

okey joint anglest , okey vel

t }. Although more information mightgive better performance, in practice, we exclude all the keyinformation except for ablation study.Tacto Tactile Image. The robot can observe the history oftactile sensory output over the last 2 timesteps, which are60× 80 depth-like images. otacto

t ∈ R60×80.Hand Position. The robot can observe the Cartesian po-sitions of the wrist the hand. We fix the hand height tobe certain, so it becomes two dimensional observation.ohand positiont ∈ R2

Hand Joint Angles and Velocities. The robot observes thecurrent hand joint angles and optionally velocities. In oursetup, there are in total 12 joints in use for the allegro hand:ohand joint anglest ∈ R12, ohand joint vel

t ∈ R12.Key Joint Angles and Velocities. The robot can optionallyobserve the current piano key joint angles and velocities. Wenote that these are privileged knowledge in real world piano.We add it mainly for ablation purpose. okey joint angles

t ∈ R12,okey velt ∈ R12.

Prioperceptive sensing

Tactile sensing

Music Score

RewardMIDI

Converter Notes

Hand Sensory Output

ConvNets

Action

Simulated Piano

Fully Connected Nets

Fig. 2: The overview of the proposed method. The observation space is composed of three components: 1) vectorizedMIDI sheet music which is converted through a MIDI converter 2) tactile sensory data from DIGIT sensor processed byconvolutional neural network 3) robot hand joint and wrist information extracted from the simulator. We use a fully connectedneural net as the policy network that interacts with the simulated piano. Then we compute the reward function based on theMIDI music and the output audio from the piano.

2) Action Space: Given the observations elaborated inSection III-A.1, the robot learns a parameterized pol-icy π to generate actions compising of 1) movementof fingers joints 2) movement of hand position. at ={afinger 1t , afinger 2

t , afinger 3t , awrist

t }.Finger Movements. The robot can control each joint of thefingers with either specified velocity or torque: afingeri

t ∈[−1, 1] where −1 indicates to lift the finger with maximumspeed/torque, and 1 means put down the finger with maxi-mum speed/torque.Hand Movement. The robot can move the hand wristposition. Hence, the hand movement action awrist

t ∈ [−1, 1].−1 indicates moving toward the left and 1 indicates movingtoward the right. To alleviate exploration issues, we can alsodiscretize the action space.

3) Reward Structure: One important building block forusing RL agents to play the piano is an informative rewardfunction. An ideal reward function should include a fewthings: (1) It penalizes when a key is not attacked whenthe score indicates it should be. (2) It penalizes when thecorrect key is hit but with a wrong velocity.

We first introduce the most general reward as

r(s, a) = Const−|K|∑i

d(ki, ki) , (1)

where K is the set of all the keys that is possible to beattacked. | K | indicates the cardinality of set K. d(·) isa distance that combines the difference between normalizedvelocities of two keys

d(key, key) = |vkey− vkey| . (2)

We add a small constant so that even the agent plays oneincorrect key, the reward is still positive. This is importantfor encouraging the agent to play a key rather than doing

nothing. We also note that the distance function can bechosen based on empirical performance.

However, for harder tasks, we might need a more infor-mative dense reward function to guide the agent. We use S,X T to represent the played keys, imagined keys and targetkeys (in the music score) at every time step instead of usingK for all the possible keys. If S is the set of attacked keysand T is the set of goal keys. We want to design a distancefunction l(S, T ) between S and T such that for any T , theminimizer of l(S, T ) w.r.t. S is unique and equals to S = Tand l(T, T ) = 0. Hence, we will have the dense rewardfunction

r(s, a) = −Wd,σ(S, T ) , (3)

where the distance function Wd can be written as

Wd,σ(S, T ) = min|X|=|T |

[∑i

d(Xi, Ti) + σ|X4S|

]. (4)

Our proposal here is inspired by the earth mover’s dis-tance [29]. Suppose we want to modify S to T , with threepossible operations: (a) remove an item in S with cost σ > 0;(b) add an item in S with cost σ; (c) modify an item in Sto an arbitrary item with cost d(·, ·). The distance betweenS and T can then be defined by the minimal cost amongall modification scheme from S to T . We note that all thethree operations have corresponding operation for the agent:(a) release a piano key, (b) attack a piano key, and (c) movefrom one attacked key to another. The solution to the optimalmodification scheme is then given by Equation (4). Moreprecisely, we break the optimal modification scheme into twosteps: first we add or remove items, then we modify items.The optimized X in Equation (4) is essentially the modifiedS after the first step. The cost from S to X is σ|X4S|,

and the cost from X to T is∑i d(Xi, Ti). The optimization

w.r.t. X is done using SciPy [30].

B. Music Score Representation

Music can be represented by wave form or spectrogram.However, it can be hard to let robot take continuous input anddesign reward functions with these representation. Hence, wechoose the Musical Instrument Digital Interface (MIDI) asthe audio representation. MIDI has timing information fornote-on and note-off events and velocity information for eachactivated note. We freeze the tempo information to 60 beatsper minite. To augment the MIDI representation, we alsoinclude fingering information that indicates which finger touse for a note. Fingering indicators are usually provided evenfor humans when it is hard to explore and find the optimalsolution.

In practice, we use the matrix S ∈ RT×K×3 to representthe score. Here T represents the length of each episode,K represents number of keys and 3 means there are note,velocity and fingering information. For the note dimension,we use binary variable in 1, 0 to indicate which note to play.In the velocity dimension, we use a float number between 0and 1.

This representation can be converted to MIDI file easily.The score is composed of several notes Ni ∈ RLi×K×3

where we require∑i Li = T .

C. Soft Actor-Critic

We use offline reinforcement learning algorithm Soft ActorCritic (SAC) [31] for training the policy. It is an off-policyRL method using the actor-critic framework. There are threetypes of parameters to learn in SAC: (i) the policy parametersφ; (ii) a temperature α; (iii) two sets of the soft Q-functionparameters θ1, θ2 to mitigate overestimation [32]. We canrepresent the policy optimization objective as

Jπ(φ) = Est∼D[Eat∼πφ [α log πφ(at|st)−Qθ(st, at)]

],(5)

where α is a learnable temperature coefficient, and D is thereplay buffer. It can be learned to maintain the entropy levelof the policy

J(α) = Eat∼πφ[−α log πφ(at|st)− αH

], (6)

where H is a desired entropy. The soft Q-function parametersθi, i ∈ {1, 2}, can be trained by minimizing the soft Bellmanresidual as,

JQ(θi) = E(st,at)∼D

[1

2(Qθi(st, at)− Q(st, at))

2

], (7)

where

Q(st, at) = r(st, at) + γEat+1∼πφ(st+1)

[min

j∈{1,2}Qθj (st+1, at+1)− α log πφ(at+1|st+1)

],

and where Qθj ’s are the target networks [33]. We note thathere we abuse the notation a little: the st denotes the featurevector extracted from the observation ot.

Fig. 3: Samples from the piano-robot hand simulator. (Left)Hand playing white keys of the piano. (Right) Hand playingblack keys of the piano.

The policy network is a neural network that computesu = fθ(o) where x contains a set of touch images as wellas physical states: o = (Itacto, s). In our experiments, wecompare with different combination of inputs.

1) Architecture: The policy network initially have convo-lutional layers for image inputs and multi-layer perceptron(MLP) layers for vectorized state inputs. We use relu asour activation function. The feature from both part are thenconcatenated together and fed into another MLP network foroutputting an action for controlling the agent. We also triedvarious design choices such as batch normalization layers orsubstituting the convolutional neural network with bilinearimage downsampler. We find these design choices only havesmall influence on the final performance. We attribute thisto the hard exploration challenge underlying in this task.

2) Training: During training, we use adam optimizer withlearning rate 1e-3. We initially set the tunable temperaturevariable α = 0.02. We allow 15000 exploration steps beforetraining starts for better exploration.

IV. EXPERIMENTAL RESULTS

In this section, we evaluate the RL-based agent’s per-formance on various meaningful piano playing tasks. Wealso ablate the model to recognize core factors for trainingsuch agents. In particular, we try to answer the followingquestions:• Can we use RL to learn to play the piano and produce

correct notes, rhythm, chords and fingering?• Can we handle the long horizon nature of the task?• What are the important considerations for training such

agents i.e. design choices, useful states?

A. Simulation

We build an environment using the Bullet physics en-gine [34] in conjunction with the TACTO tactile sensorssimulator [13]. Our simulated environment consists of apiano keyboard and an allegro [35] robot hand provided withDIGIT tactile sensors [24] mounted on each fingertip. Fig. 3shows visualized samples from the environment. The pianoshas 14 white keys and 10 black keys that have 128 discretizedlevels of key velocities to reflect the sound of an attackedkey. We simplify the dynamics of the piano keyboard witha non-linear spring force.

The allegro hand is initialized above the keyboard withrandomness. The each joint of the hand can be controlled;however, for most of the experiments we only freeze certain

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Steps 1e5

40

20

0

20

40

60

80Ret

urn

State_Only Tacto Random Scripted

(a) One-note Task (b) Three-note Task (c) Six-note Task

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Steps 1e5

50

0

50

100

150

200

250

300

Ret

urn

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Steps 1e5

50

0

50

100

150

200

Ret

urn

Fig. 4: Comparison on different levels of music tasks. All the experiments are run with 5 seeds. Shade is the one standarddeviation. We find that RL-based algorithms can match the performance of manually designed controller. Tactile inputsimproves the performance slightly.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Steps 1e5

80

60

40

20

0

20

40

60

80

100

Return

State_OnlyTactoRandomScripted

(a) rhythm task (quarter note + eighth notes)

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Steps 1e5

40

20

0

20

40

60

80

100

Return

State_OnlyTactoRandomScripted

(b) chord task

Fig. 5: Evaluation on tasks with different rhythms and chords.All the experiments are run with 5 seeds. The shade areais one standard deviation. Our algorithm can successfullyaccomplish harder music tasks.

joints that are not crucial for the task. The hand can explorehorizontally above the keyboard. We use the reward functiondefined in Equation (1) and/or Equation (3) for the task.

To evaluate the performance, we use both the accumulatedreward and qualitative samples.

B. Learning to play the piano

We list representative tasks that mimicking realistic ama-teur piano players learning to play.One-Note: Play one specified quarter note with correspond-ing velocity.Three-Note: Play three specified quarter notes with corre-sponding velocity.

Six-Note Task: Play twelve specified quarter notes withcorresponding velocity.Rhythmic Tasks: Play specified three notes that are ran-domly chosen from quarter notes, eighth notes and triplets.Chord Task: Play three chords with corresponding velocity.Two notes are played simultaneously in a chord.

We now answer the question Can we use RL to learnto play piano? In Fig. 4, we find that the RL agent canachieve reasonable return after training 2e5 iterations forthe one-note, the three-note and the six-note tasks. Theaveraged returns for each note become lower when the taskdifficulty level increases. We attribute this to the longerhorizon in harder tasks and the corresponding explorationchallenge. Comparing between the RL agent with tactileimage and the agent without it, we find that the tactile sen-sor may improve the learning efficiency slightly. However,the ultimate performance would be similar if the agent istrained sufficiently. We also compare the RL-based agentswith simple controllers and scripted agents. We find thatRL-based agents can outperform the simple random notehitting controller by a large margin while achieving similarperformance as the scripted agents. We note that the scriptedagents are hardcoded with human priors about this task.

In Fig. 5, we experiment with different tempo types thatincludes eighth notes (half beat). We find that the proposedmethods have the ability not only to handle quarter notes,but mixture of different note duration. In Fig. 5, we alsoexperiment with the chord task to play multiple notes simul-taneously. We find that the agent can successfully play thefirst two chords. For the third chord, there is occasionally onewrong note. We attribute this to lack of exploration for theend of a task. This might be fixed by better exploration strat-egy and more exploration steps. After we confirm that RL cansuccessfully achieve piano playing tasks, we further study theeffect of different sensory observation on the proposed tasks.In Fig. 4, we have three more ablative experiments with 1)tactile sensory input marked as TACTO, 2) ground-truth handposition and velocity marked as hand info, 3) ground-truthhand info and piano keyboard position and velocity markedas hand+piano info. We see that after adding the ground-truthinformation, the reward becomes higher. This indicates thepolicy learn to leverage the provided groundtruth informationfor better control. Another observation is that tactile sensory

0 1 3 6 10 20Number of Notes

0

200

400

600Avg

. Ret

urn

CompositionalRL

Fig. 6: Comparison between compositional execution ofpolicies and end-to-end trained policy for the task. Weobserve that when playing longer executions, our compo-sitional approach outperform the simple end-to-end learningapproach.

input (TACTO) outperforms the baseline. This indicates thatTACTO sensor can be helpful for piano playing tasks whileno ground-truth information are provided.

C. Piano playing agent with fingering indicator

In this section, we investigate whether the agent can playthe notes correctly with tactile-based fingering indicator.We apply the same algorithm with the reward functiondescribed in Section III. We note that a well-trained agentwould have similar returns as there is no fingering indicationreward. However, qualitatively the results can be different. InFig. 7, we demonstrate that the proposed reward will resultin correct and more efficient fingering. The agent withoutfingering indication reward cannot find the correct fingering.This is very similar to piano learning in the real world whereamateur players have hard times finding the most efficientfingering without finger indication.

D. Investigation of factors for RL training

1) Reward Design: We compare the reward describedin Section III-A.3. Specifically, we have three reward func-tion 1) linear difference reward, 2) squared difference reward,3) wasserstein reward.

In Table II, we compare the results based on differentreward functions. We note that since different reward func-tions cannot be compared directly, we test the trained agentson the linear reward. We find that the wasserstein rewardoutperforms the other two rewards even evaluated based onlinear rewards. This is because wasserstein reward has densefeedback and hence can guide the agent to better policy. Wealso find that the squared difference loss has slightly lowerperformance than the linear difference reward. This might bedue to the wrong scale of the reward function.

2) Exploration Strategy: We ablate the exploration strate-gies that has different amount of priors. Specifically, we havethree exploration strategies: 1) uniform action where all theactions are sampled uniformly, 2) uniform action with repeatswhere each action has a certain probability to repeat theprevious action, 3)action repeats and wrist stablizer wherewe add a constraint that the agent’s wrist cannot move tooften based on 2). We note that these exploration strategies

TABLE I: Evaluation on different exploration strategies. Allthe experiments are run with 5 seeds. One standard deviationis in the parenthesis. Adding more priors to the explorationstrategy would significantly improve the performance.

Avg. Return (Std)

Prior Explore 119.37(61.2)Uniform Move Hand 67.82(61.3)No Action Repeat 74.52(39.4)

TABLE II: Evaluation of different reward functions. Allexperiments are run with 5 seeds. One standard deviationis shown in the parenthesis. The results are evaluated withthe Linear reward. Denser rewards improves the performanceslightly.

Avg. Return (Std)

Linear 35.21(19.32)Squared 17.55(16.12)Wasserstein 43.03(20.08)

are with stronger human priors. However, these priors isfrom very intuitive based on human experience and easy toimplement. From Table I, we can observe that human priorsin the exploration strategy largely improves the performanceby a large margin. We use these exploration strategies tocollect data before training starts. We believe stronger priorsmight lead to even better performance; however, we onlyimplement the two most natural ones.

3) Controlling Method: We compare the control mode forthe tasks including velocity control and torque control. InFig. 8, we find that there is no significant difference betweendifferent control modes. Although it is usually harder to trainRL with torque control, in the piano tasks they performsimilarly.

4) Action Space Design: In this section, we compare dif-ferent action spaces’ effect to the performance. Specifically,we compare controlling all the 9 joints versus controlling 3lower joints of the hand. In Fig. 9, We find the using multiplejoints will lead to high variance and lower rewards duringthe early stage of training. However, after some iterations,multi-joint training can match the performance of only usingthe one joint for each finger. This allows more flexibility fortasks for more dexterous tasks. However, we find on hardertasks, it is not possible to achieve reasonable results withmultiple joints.

E. Compositional Execution for long-horizon task

Many realistic piano pieces are long in time. Hence, RLalgorithms might suffer from the long horizon nature and failto learn to play the required notes. In this section, we answerthe question on how can RL agents deal with the long-horizon nature of music tasks. However, it is natural to learnan easier task within a shorter horizon L and repetitivelyexecute the policy with a sliding window only showing thecomposition that’s within L. In Fig. 6, we find that when thehorizons of tasks are short, there is no difference between RL

w/ Finger Indicator

w/o Finger Indicator

Less efficient

Fig. 7: Qualitative results for training with or without tactile fingering indicators. The robot hand trained without fingeringindicators will easily converge to local optimum using wrong fingering. The robot hand trained with tactile fingering indicatorsis playing the note with the most efficient finger usage. This is similar to human piano learners who has a hard timeunderstanding the best fingering for a specific piece.

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Steps 1e5

60

40

20

0

20

40

60

80

Return

Velocity_ControlTorque_Control

Fig. 8: Comparison of using torque control versus velocitycontrol. All the experiments are run with 5 seeds and theshaded area is one standard deviation. The results show that,in simulation, the RL algorithm can learn to solve the taskin either of the two modes without significant difference inperformance.

and compositional execution while compositional executionis superior when the horizon is significantly longer. This isdue to the hard exploration problem in long horizon tasks.This also shed some lights on future research on this trackthat we can train well an RL agents within a phrase or a fewnotes and execute the whole piece with one policy.

V. DISCUSSION

This work opens ample research opportunities forlearning-based piano playing. Our experiments show thatdeep reinforcement learning can enable robot hand to playnotes from scores. However, the performance would deterio-rate when the task horizon length becomes longer. When end-to-end trainable agents fail to work, we combined composi-tional execution with the learning-based agents. Therefore,it is interesting to explore how to develop new algorithmsto deal with complex long-horizon tasks. While our workis opening a welcoming avenue for using reinforcementlearning on piano playing tasks, there are still many steps tolearn to play piano with a real-world robot. We now brieflydiscuss limitations of this work and possible future works.

The first future direction is to transfer the simulated handand DIGIT sensor to hardware device. The domain gapbetween simulation and real world hardware are usually

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00Steps 1e5

80

60

40

20

0

20

40

60

80

Ret

urn10 Joints4 Joints

Fig. 9: Comparison when controlling a higher number ofjoints (i.e., DOFs). All the experiments are run with 5 seeds,and the shaded area indicates one standard deviation. Fromthese results we observe that for the one-note task, RL canaccomplish the task also when controlling a higher numberjoints without significant loss of performance.

resolved by large scale domain randomization or model-based approach. Future works can learn dynamics modelsdirectly from real world data for model-based planningand/or reinforcement learning. Another possible way is tocurate simulators that are perturbed in hyper-parameters fordomain randomization. Moreover, in this work we use rewardfunctions that require the knowledge of instant velocity ofattacked keys. Although it can be obtained from commercialelectrical piano, there is not APIs designed for robotics tasks.Besides directly requesting for API, one possible way is touse tactile sensory output as surrogate for the key velocity.We also note that even the most advanced mechanical handis not as dexterous as human hand – this leaves a naturalbarrier for enabling robot hand to play music pieces wrote forpianists. We have seen efforts toward designing better hands;however, making such active hands that have similar levelof dexterity to humans’ remains challenging. In this work,we tried to minimize human inductive bias in model-freetraining. While it shows the high potential of an intelligentagent, in the real-world this approach would results in theequivalent of approximately 5 days to play three notes whichis inefficient. We believe either learning a dynamics model orinjecting more inductive bias would help the agent to learn

faster. For example, imitation learning is a popular approachwhen sufficient expert data are available. It becomes an inter-esting open challenge to use existing knowledge from humanpianists and human data. We believe these are important andpromising future directions for future researchers who areinterested in robot piano playing.

VI. CONCLUSION

In this paper, we proposed the first reinforcement learningbased approach for learning to play piano with robot handsequipped with tactile sensors. Since the task is new toreinforcement learning, we formalize the task as MDP anddetail about specific designs for all the components of theMDP. With off-the-shelf reinforcement learning algorithms,we can train a robot hand to play the piano with correctnotes, velocity and fingering. We also carefully study corefactors in the whole system and provide useful informationfor future research on this track.

REFERENCES

[1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcementlearning,” arXiv preprint arXiv:1312.5602, 2013.

[2] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A generalreinforcement learning algorithm that masters chess, shogi, and gothrough self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.

[3] P. Company., Metro-art music for the pianola piano 88 note / [PianolaCompany Pty Ltd.], 3rd ed. Pianola Coy Sydney, [192-?].

[4] S. Sugano and I. Kato, “Wabot-2: Autonomous robot with dexterousfinger-arm–finger-arm coordination control in keyboard performance,”in IEEE International Conference on Robotics and Automation, vol. 4,1987, pp. 90–97.

[5] J. Lin, H. Huang, Y. Li, J. Tai, and L. Liu, “Electronic piano playingrobot,” in International Symposium on Computer, Communication,Control and Automation (3CA), vol. 2, 2010, pp. 353–356.

[6] D. Zhang, Jianhe Lei, Beizhi Li, D. Lau, and C. Cameron, “Designand analysis of a piano playing robot,” in International Conference onInformation and Automation, 2009, pp. 757–761.

[7] A. Topper, T. Maloney, S. Barton, and X. Kong, “Piano-playing roboticarm,” 2019.

[8] Y. Li and L. Chuang, “Controller design for music playing robot —applied to the anthropomorphic piano robot,” in IEEE 10th Interna-tional Conference on Power Electronics and Drive Systems (PEDS),2013, pp. 968–973.

[9] Y.-F. Li and C.-Y. Lai, “Intelligent algorithm for music playingrobot—applied to the anthropomorphic piano robot control,” in IEEE23rd International Symposium on Industrial Electronics (ISIE), 2014,pp. 1538–1543.

[10] Y.-F. Li and T.-S. Li, “Adaptive anti-windup controller design for thepiano playing robot control,” in IEEE International Conference onReal-time Computing and Robotics (RCAR), 2017, pp. 52–56.

[11] B. Scholz, “Playing piano with a shadow dexterous hand,” Ph.D.dissertation, Universitat Hamburg, 2019.

[12] J. A. E. Hughes, P. Maiolino, and F. Iida, “An anthropomorphic softskeleton hand exploiting conditional models for piano playing,” vol. 3,no. 25, 2018.

[13] S. Wang, M. Lambeta, P.-W. Chou, and R. Calandra, “TACTO: A fast,flexible and open-source simulator for high-resolution vision-basedtactile sensors,” arXiv preprint arXiv:2012.08456, 2020.

[14] Y. Li and C. Lai, “Development on an intelligent music score inputsystem — applied for the piano robot,” in Asia-Pacific Conference onIntelligent Robot Systems (ACIRS), 2016, pp. 67–71.

[15] A. M. Batula and Y. E. Kim, “Development of a mini-humanoidpianist,” in IEEE-RAS International Conference on Humanoid Robots,2010, pp. 192–197.

[16] V. Jaju, A. Sukhpal, P. Shinde, A. Shroff, and A. B. Patankar, “Pianoplaying robot,” in International Conference on Internet of Things andApplications (IOTA), 2016, pp. 223–226.

[17] A. J. Topper and T. Maloney, “Piano-playing robotic arm,” 100Institute Road, Worcester MA 01609-2280 USA, Tech. Rep., April2019.

[18] C. C. Wee and M. Mariappan, “Finger tracking for piano playingthrough contactless sensor system: Signal processing and data trainingusing artificial neural network,” in IEEE 2nd International Conferenceon Automatic Control and Intelligent Systems (I2CACIS), 2017, pp.41–45.

[19] Y. Li and C. Huang, “Force control for the fingers of the pianoplaying robot — a gain switched approach,” in IEEE 11th InternationalConference on Power Electronics and Drive Systems, 2015, pp. 265–270.

[20] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman,E. Todorov, and S. Levine, “Learning complex dexterous manipulationwith deep reinforcement learning and demonstrations,” arXiv preprintarXiv:1709.10087, 2017.

[21] O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. Mc-Grew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray et al.,“Learning dexterous in-hand manipulation,” The International Journalof Robotics Research, vol. 39, no. 1, pp. 3–20, 2020.

[22] I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew,A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al., “Solvingrubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113,2019.

[23] W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High-resolution robottactile sensors for estimating geometry and force,” Sensors, vol. 17,no. 12, p. 2762, 2017.

[24] M. Lambeta, P.-W. Chou, S. Tian, B. Yang, B. Maloon, V. R. Most,D. Stroud, R. Santos, A. Byagowi, G. Kammerer, D. Jayaraman, andR. Calandra, “DIGIT: A novel design for a low-cost compact high-resolution tactile sensor with application to in-hand manipulation,”IEEE Robotics and Automation Letters (RA-L), vol. 5, no. 3, pp. 3838–3845, 2020.

[25] R. Calandra, A. Owens, D. Jayaraman, J. Lin, W. Yuan, J. Malik,E. H. Adelson, and S. Levine, “More than a feeling: Learning to graspand regrasp using vision and touch,” IEEE Robotics and AutomationLetters, vol. 3, no. 4, pp. 3300–3307, 2018.

[26] Y. She, S. Wang, S. Dong, N. Sunil, A. Rodriguez, and E. Adelson,“Cable manipulation with a tactile-reactive gripper,” arXiv preprintarXiv:1910.02860, 2019.

[27] C. Wang, S. Wang, B. Romero, F. Veiga, and E. Adelson, “Swing-bot: Learning physical features from in-hand tactile exploration fordynamic swing-up manipulation,” arXiv preprint arXiv:2101.11812,2021.

[28] R. Bellman, “A markovian decision process,” Journal of Mathematicsand Mechanics, vol. 6, 1957.

[29] S. Vallender, “Calculation of the wasserstein distance between proba-bility distributions on the line,” Theory of Probability & Its Applica-tions, vol. 18, no. 4, pp. 784–786, 1974.

[30] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy,D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J.van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J.Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, I. Polat, Y. Feng,E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman,I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H.Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors,“SciPy 1.0: Fundamental Algorithms for Scientific Computing inPython,” Nature Methods, vol. 17, pp. 261–272, 2020.

[31] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan,V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-criticalgorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.

[32] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi-mation error in actor-critic methods,” in International Conference onMachine Learning. PMLR, 2018, pp. 1587–1596.

[33] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,D. Silver, and D. Wierstra, “Continuous control with deep reinforce-ment learning,” arXiv preprint arXiv:1509.02971, 2015.

[34] E. Coumans and Y. Bai, “Pybullet, a python module for physics sim-ulation for games, robotics and machine learning,” http://pybullet.org,2016–2019.

[35] “Allegro hand,” https://github.com/simlabrobotics/allegro hand ros.