CURRENNTWaveNet Implementation
1
XinWANGNationalInstituteofInformatics,Japan
2017-11-03
contact:[email protected],suggestions,discussionarewelcome
CURRENNTWAVENETContentsl ElementsoftheWaveNet inCURRENNT
l Multi-networkswithdifferenttimeresolution
l TimeVSmemoryconsumptionduringgeneration
2
CURRENNTWAVENETBeforestaringl PleasechecktheslidesonthebasicsofCURRENNTl PleaseunderstandhowtodebugwithCURRENNTl PleasesetuptheCURRENNT_SCRIPTandruntheexample
CONFIGPOOL/config_wavenet.pm
l Afterrunningtheexample,pleaseCDintothedirectoryofWaveNet example
3
~$ cd CURRENNT_DIS/EXAMPLE/MODEL_WAVENET~$ pwd…/CURRENNT_DIS/EXAMPLE/MODEL_WAVENET~$ lsREADME createMDNConfig.py mdn.config output_trained_network_mdn1.000000config.cfg data_mgcf0.mv.bin network.jsn trained_network.jsnconfig_syn.cfg log_train network.jsn2
General
l ThisistheWaveNet structureusedforexperimentsl Thestructureisslightlydifferentfromtheliterature.Iwillexplain
itlater
CURRENNTWAVENET
4
Textual/acoustic features
1-D CNN
Diluted1-D CNN
Sub-network
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNNs softmax Waveform+
1-D CNN +
Diluted1-D CNN
+
Tanh Sigmoid
*…
1-D CNN
1-D CNN +
CURRENNTWAVENETElementsofWaveNet
l First,considerthecasewithoutsub-networkfortext/acousticfeatures(whichisthenetwork.jsn2 intheMODEL_WAVENET)
5
Textual/acoustic features
1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNNs softmax Waveform+
1-D CNN +
Diluted1-D CNN
+
Tanh Sigmoid
*…
1-D CNN
1-D CNN +
CURRENNTWAVENETElementsofWaveNetl “Input”layers• Let’slookatthefollowinglayersinnetwork.jsn2
6
1-D CNN
Waveform(time shifted)
"layers": [ { "size": 1, // The input layer size is 1 "name": "input", // But waveform is not loaded here "type": "input" }, { "size": 256, // Waveform is loaded here as input "name": "feedbackBottom", "bias": 1.0, "type": "feedback", "previousDimEnd": 0, "previousDimStart": 0 }, { "size": 64, // Feed the waveform one-hot vector "name": "causalEmbedding", // into the feedforward layer "bias": 1.0,
"type": "feedforward_identity" },
1-D CNN
Waveform
input
feedback
index
Implementation innetwork.jsn
CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2
• Thislayerloadsasequenceofindex
• Waveformdataareatthesample-level.Buttheacoustic/textualfeaturesareattheframelevel
• Theinputindexjusttellsthenetwork:whichframedoeseachwaveformsamplingpointbelongto 7
{ "size": 1, // The input layer size is 1 "name": "input", // But waveform is not loaded here "type": "input" },
input
index
Textual/acoustic features
Waveform(time shifted)
Wavenet
Waveform
Sample-level Frame-level
CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2• InMODEL_WAVENET,thewaveformis16kHzmu-lawwhile
theframe-shiftis5ms
• Therefore,everyframecorrespondto16*5=80waveformsamplingpoints
8
0 0 … 0 1 1 … 1 2 2 … 2Sample-levelInputindex
frame-levelacoustic/textualfeatures
Frame0
Frame1
Frame2
80numbers
CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2• Let’stakealookattheinputindex
• ***.labindx inRAWDARAistheinputdatathatwillbepackagedintothedata.nc*
• Waveform***.rawispackagedastheoutputdata
• PleasecheckCURRENNT_DIS/EXAMPLE/DATA_WAVENET9
# Please use Python and load the pyTools>> from ioTools import readwrite>> cd CURRENNT_DIS/EXAMPLE/RAWDATA>> data = readwrite.read_raw_mat('BC2011_nancy_NYT096-008-00.labindx', 1)>> data[0:80]array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0 … 0.], dtype=float32)# data[0:80] is zero>> data[80:160]array([ 1., … 1.], dtype=float32)# data[80:160] = 1>> data[::80]array([ 0., 1., 2., 3., 4., 5., 6., 7., 8.,… 943., 944., 945.], dtype=float32)# data is just a sequence of index, where data[i] = i/80# data.shape = (75680, 0), which has the same length as the waveform BC2011_nancy_NYT096-008-00.raw
CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2• TakealookattheCURRENNT
• Ifyouusedifferentsamplingrateandframeshift,pleasegeneratedthecorrectinputindex
10
~$ gdb --args currennt_debug --options_file config.cfg --cuda off(gdb) b InputLayer.cu:157(gdb) r157 thrust::copy(fraction.inputs().begin(), fraction.inputs().end(),(gdb) p fraction.inputs()[0] // fraction loads input from data.nc$1 = (float &) @0x7fffd4001660: 0 // fraction.inputs() stores the input index(gdb) p fraction.inputs()[80]$2 = (float &) @0x7fffd40017a0: 1 // the index increases every 80 sampling points(gdb) p fraction.inputs()[160]$3 = (float &) @0x7fffd40018e0: 2 // They will be copied to the this->outputs()
CURRENNTWAVENETElementsofWaveNetl Feedbacklayerinnetwork.jsn2
• Becausewaveformisnotstoreasinputdata,itmustberetrievedfromtheoutputbuffer.Thisisimplementedbythefeedbacklayer
• Usually,afeedbacklayerwillconcatenatetheoutputdatafromthepreviouslayerandthefeedbackdata.Here,itisunnecessarytoloadtheoutputfromthepreviouslayer(whichisthesequenceofindex)
• ThuspreviousDimEnd=0andpreviousDimStart=0willtellthislayernottoloadtheoutputfromthepreviouslayer 11
{ // the example waveform is 8bit mu-law, thus "size": 256, // each waveform point is a one-hot vector of 256 dim "name": "feedbackBottom", "bias": 1.0, "type": "feedback", "previousDimEnd": 0, // don’t load input from previous layer (input layer) "previousDimStart": 0 },
Waveform feedback
CURRENNTWAVENETElementsofWaveNetl Feedbacklayerinnetwork.jsn2• Let’sdebug
• Checkthefunctor vectorFillForward incurrennt_lib/src/layers/FeedBackLayer.cu
12
(gdb) b FeedBackLayer.cu:606606 thrust::fill(this->outputs().begin(), this->outputs().end(), 0.0);// The code after this line is the place where the feedback data are loaded// This implementation is based on the functor internal::vectorFillForward// You can debug through this part step by step// For example:(gdb) b 626(gdb) c626 fn.input2= helpers::getRawPointer(m_targetLayer->feedbackOutputs(true));(gdb) p m_targetLayer->name() // m_targetLayer is the layer from which the waveform is fed back$4 = "postoutput” // In this case, it is the last postoutput layer
105 if (lookBack != NULL)106 lookBackTime = lookBack[dimIdx / dimInput2Valid] * parallel;107 else108 lookBackTime = 1; // by default, the feed back data will be shifted by 1 time unit109 // thus, no need to worry about the time shifting operation110111 dimIdx = dimIdx % dimInput2Valid;112113 if (timeStep < lookBackTime)114 output[outputIdx] = 0.0; //115 else{116 output[outputIdx] = input2[(timeStep - lookBackTime) * dimInput2Total +117 dimIdx + dimInput2Start];
CURRENNTWAVENETElementsofWaveNetl Feedbacklayerinnetwork.jsn2• Thisfeedbacklayershiftthewaveformbyonestep
13
Feedback
1 1
1 1 1
1 1 1 1 1
1 1
Originalwaveform
m_targetLayer->feedbackOutputs(true)
0 1
0 1 1 1
0 1 1 1 1 1
0 1 1
Feedbackwaveform
this->outputs()
CURRENNTWAVENETElementsofWaveNetl 1-DCNNlayersinnetwork.jsn2• Thentheshiftedwaveformone-hotvectorswillgo
throughthe1-DCNN(feedforwardlayer)andbecometheembeddedvectors
• 1-DCNNisimplementedbyasimplefeedforwardlayer
14
0 1
0 1 1 1
0 1 1 1 1 1
0 1 1
Feedbackwaveform
1-D CNN
0 0.1
0 1.2
0 …
1-D CNN
Waveform
input
feedback
index
Embeddedvectors
CURRENNTWAVENETElementsofWaveNetl Summaryof“inputlayers”
• Inputlayerloadstheinputindex• Feedbacklayerretrievesandshiftsthewaveform• 1-DCNNconvertsthewaveformone-hot-vectorinto
embeddedvectors
15
1-D CNN
Waveform(time shifted)
1-D CNN
Waveform
input
feedback
index
Implementation innetwork.jsn2
CURRENNTWAVENET
16
ElementsofWaveNetl WaveNet Block• Whynotthisstructure===========>• Bothworks(asexperimentssaid)• Thestructurebelowforces1-DCNNto
usetheflowfromthehighwaypass
Textual/acoustic features1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNN +
Diluted1-D CNN
+
Tanh Sigmoid
*
…
1-D CNN
1-D CNN +
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
1-D CNN +
Ithought theleftstructurecouldbeeasiertoanalyze
CURRENNTWAVENET
17
ElementsofWaveNetl WaveNet Block• Network.jsn2contains10WaveNet blocks.TheirCNN
dilutionsizeis{1,2,4,8,16,32,64,128,256,512}• These10blocksformonegroup.• Layersofonewavenet blockisnamedasdiluteB*L***
~$ grep name network.jsn2"name": "input","name": "feedbackBottom","name": "causalEmbedding","name": "causalSkip","name": "diluteB1L1cnn","name": "diluteB1L1wavc","name": "diluteB1L1out","name": "diluteB1L1skipadd","name": "diluteB1L1temp","name": "diluteB1L1skipouttrans","name": "diluteB1L1skip","name": "diluteB1L1temp2","name": "diluteB1L2cnn","name": "diluteB1L2wavc","name": "diluteB1L2out","name": "diluteB1L2skipadd","name": "diluteB1L2temp","name": "diluteB1L2skipouttrans","name": "diluteB1L2skip","name": "diluteB1L2temp2","name": "diluteB1L3cnn",…
Inputlayers
1st blockinthe1st groupcausalSkipisaspeciallayerfortheB1L1
2nd blockinthe1st group
B:group indexL:blockindexinthegroup
CURRENNTWAVENET
18
ElementsofWaveNetl WaveNet Block• BlockB1L1
{ "size": 64, "name": "causalSkip", "bias": 1.0, "type": "skipini" }, { "size": 128, "name": "diluteB1L1cnn", "bias": 1.0, "type": "cnn", "window_width": "128*1", "window_tap_interval": "128*1", "causal": 1, "outputTanh": 0 }, { "size": 64, "name": "diluteB1L1wavc", "bias": 1.0, "type": "wavnetc", "contextDim": 61, "contextMV": "./data_mgcf0.mv.bin" }, { "size": 64, "name": "diluteB1L1out", "bias": 1.0,
"type": "feedforward_identity" }, { "size": 64, "name": "diluteB1L1skipadd", "bias": 1.000000, "type": "skipadd",
"preSkipLayer": "causalSkip,diluteB1L1out" },
Textual/acoustic features1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNN +
…
Thestartingpointoftheskip-addconnection
DilutedCNN
WaveNet core
Feedforward(1-DCNN)
Skip-add
CURRENNTWAVENET
19
ElementsofWaveNetl WaveNet Block• BlockB1L1
• CURRENNTisnotsoflexible.Ifalayer’soutputistobeusedbymultiplelayers,“skipini”(and“skipadd”)mustbeusedtoprovidemultipleoutputpipes
{ "size": 64, "name": "diluteB1L1temp", "bias": 1.0, "type": "skipini" }, { "size": 256,
"name": "diluteB1L1skipouttrans", "bias": 1.0,
"type": "feedforward_identity" }, { "size": 256, "name": "diluteB1L1skip", "bias": 1.0, "type": "skipini" }, { "size": 64, "name": "diluteB1L1temp2", "bias": 1.0, "type": "skipadd",
"preSkipLayer": "diluteB1L1temp" },
diluteB1L1skip
diluteB1L1temp2
Textual/acoustic features1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
1-D CNN +
CURRENNTWAVENET
20
ElementsofWaveNetl WaveNet Block
"name": "causalSkip","name": "diluteB1L1cnn","name": "diluteB1L1wavc","name": "diluteB1L1out","name": "diluteB1L1skipadd","name": "diluteB1L1temp","name": "diluteB1L1skipouttrans","name": "diluteB1L1skip","name": "diluteB1L1temp2",
Textual/acoustic features
1-D CNN
Diluted 1-D CNN diluteB1L1cnn
+
Tanh Sigmoid
*
1-D CNNdiluteB1L1skipouttrans
Waveform(time shifted)
Wavenet Block:B1L1
1-D CNNdiluteB1L1out +
skipadd
diluteB1L1skipadd
skipinicausalSkip
skipinidiluteB1L1temp
skipinidiluteB1L1skip
Wavenetc
diluteB1L1wavc
skipadddiluteB1L1temp2
Textual/acoustic features
1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNN +
Inimplementation, onewavenet blockhas9layers:
skipadd
diluteB1L1skipadd
CURRENNTWAVENET
21
ElementsofWaveNetl WaveNet Block• Whysocomplicated?• BecauseCURRENNT
onlysupportsalinearstructureofnetwork
• Biparite ormulti-paritestructuremustbeimplementedbyusingskip-connetions
Diluted 1-D CNN diluteB1L1cnn
+
Tanh Sigmoid
*
1-D CNNdiluteB1L1skipouttrans
1-D CNNdiluteB1L1out
+
skipinicausalSkip
skipinidiluteB1L1temp
skipinidiluteB1L1skip
Wavenetc
diluteB1L1wavc
Skipadd
diluteB1L1temp2+ Nextwavenet block
Postprocessingblock
L1 L2
L0L1
L2
L0
skipini
skipadd
• ThisprincipleisexplainedintheslidesonCURRENNT_HIGHWAY
skipadd
diluteB1L1skipadd
CURRENNTWAVENET
22
Diluted 1-D CNN diluteB1L1cnn
+
Tanh Sigmoid
*
1-D CNNdiluteB1L1skipouttrans
1-D CNNdiluteB1L1out
+
skipinicausalSkip
skipinidiluteB1L1temp
skipinidiluteB1L1skip
Wavenetc
diluteB1L1wavc
Skipadd
diluteB1L1temp2+
skipadd
diluteB1L2skipadd
Diluted 1-D CNN diluteB1L2cnn
+
Tanh Sigmoid
*
1-D CNNdiluteB1L2skipouttrans
1-D CNNdiluteB1L2out
+
skipinidiluteB1L2temp
skipinidiluteB1L2skip
Wavenetc
diluteB1L2wavc
Skipadd
diluteB1L2temp2+
…
Here,noneedtoaddanother skipinilayer
CURRENNTWAVENET
23
ElementsofWaveNetl WaveNet CoreBlock• Theconditionfeatures(textualoracoustic)areloadedby
the‘wavenetc’layer• ThislayerwillnormalizetheconditionfeaturesifcontextMV
isprovided
{"size": 64,"name": "diluteB1L1wavc","bias": 1.0,"type": "wavnetc","contextDim": 61, // dimension of the condition features"contextMV": "./data_mgcf0.mv.bin”},
//Intheexample,Iusemgc (60dim)andF0(1dim)asthecondition.//data_mgcf0.mv.binisabinaryvector[mean_MGC,mean_F0,std_MGC,std_F0]//Thus,thelengthofthevectoriscontextDim *2//Themv.bin canbereadandwrittenbyusingthepyTools>> fromioTools importreadwrite>> datamv =readwrite.read_raw_mat(‘./data_mgcf0.mv.bin’,1)>> datamv.shape(122,)Textual/acoustic
features1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNN +
wavenetc
CURRENNTWAVENET
24
ElementsofWaveNetl WaveNet CoreBlock• Rememberthattheconditionacousticfeaturesareframe-
levelfeatureswhilethenetworkworksatthesample-level• Heretheinputindexisusedforthisalignment
0 0 … 0 1 1 … 1 2 2 … 2
input index
frame-levelfeatures
… … …
wavenetc
Featuresatthesample-level
CURRENNTWAVENET
25
ElementsofWaveNetl WaveNet CoreBlock• Lookatthecodecurrennt_lib/src/layers/wavNetCore.cu
0 0 … 0 1 1 … 1 2 2 … 2
input index
frame-levelfeatures
wavenetc
381 template <typename TDevice>382 void WavNetCore<TDevice>::loadSequences(const data_sets::DataSetFraction &fraction,383 const int nnState)384 {…392 // load the input index to m_contextRawBuf393 thrust::copy(fraction.inputs().begin(),fraction.inputs().end(),m_contextRawBuf.begin());…
// load the textual/acoustic features into m_contextRawBuf too.// however, the data is stored from m_contextRawBuf[PTR_SHIFT]// PTR_SHIFT is decided by the maximum length of the utterances and parallel size
405 thrust::copy(fraction.exInputData().begin(), fraction.exInputData().end(),406 (m_contextRawBuf.begin() +407 this->maxSeqLength() * this->parallelSequences()));
CURRENNTWAVENET
26
ElementsofWaveNetl WaveNet CoreBlock• Lookatthecodecurrennt_lib/src/layers/wavNetCore.cu
341 template <typename TDevice>342 void WavNetCore<TDevice>::__loadContextBuff()343 {...347 int dataPos = this->maxSeqLength() * this->parallelSequences();...353 // Load the acoustic/textual features into the sample-level buffer m_contextBuf354 {{355 internal::loadLinguisticFeature fn1;356 fn1.featureDim = m_contextDim; // dimension of the features357 fn1.paralNum = this->parallelSequences(); // parallel size358 fn1.maxFeatureLength = m_contextCurMaxLength; // maximum length (sample-level)
// starting point of the frame-level feature359 fn1.sourceData = helpers::getRawPointer(m_contextRawBuf) + dataPos;
// starting point of frame index360 fn1.frameIndex = helpers::getRawPointer(m_contextRawBuf);...
362 // load the mean and std for feature normalization363 fn1.contextMV = ((m_contextMV.size() == m_contextDim * 2)?364 helpers::getRawPointer(m_contextMV) : NULL);365... // executes the loading process367 thrust::for_each(...374 fn1);375 }}
CURRENNTWAVENET
27
ElementsofWaveNetl WaveNet CoreBlock• Note:toreducetheoverheadonloadingandmemory
allocation,onlytheB1L1willloadthetextual/acousticfeaturesandduplicatethemtothesample-level
• Otherblocksjustreadthebufferm_contextBuf ofB1L1.Otherblocksdon’tneed“contextMV”.But“contextDim”mustbeprovided
Textual/acoustic features
1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
1-D CNN +
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
1-D CNN +
Pleasecheckcurrennt_lib/src/layers/waveNetCore.cu fordetailedimplementation
CURRENNTWAVENET
28
ElementsofWaveNetl WaveNet CoreBlock• Note:thewavenetcwilltransformthedimensionofthe
conditionfeaturesbeforetheyareaddedwiththeinputwaveformfeatures
• Thistransformationisconductedinsidewavenetc
Textual/acoustic features
1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNN +
wavenetc
ff
Pleasecheckcurrennt_lib/src/layers/waveNetCore.cu:void WavNetCore<TDevice>::computeForwardPass(const int nnState){…// Step1. transform the linguistic context
…}
CURRENNTWAVENET
29
ElementsofWaveNetl WaveNet CoreBlock• Alsonote:thesizeoftheoutputfromDiluted1-DCNNmust
beequaltothesizeofwavenetc *2
Textual/acoustic features
1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNN +
wavenetc
ff
{ "size": 128, "name": "diluteB1L1cnn", "bias": 1.0, "type": "cnn", "window_width": "128*1", "window_tap_interval": "128*1", "causal": 1, "outputTanh": 0 }, { "size": 64, "name": "diluteB1L1wavc", "bias": 1.0, "type": "wavnetc", "contextDim": 61, "contextMV": "" },
CURRENNTWAVENET
30
ElementsofWaveNetl WaveNet CoreBlock• Note:textual/acousticfeaturesmustbeprovidedasexternal
data
• Thepathtothedirectory,featuredimensionandotherconfigurationsmustbegiventoconfig.cfg &config_syn.cfg
• Pleasecheckconfig.cfg andconfig_syn.cfg# Conditional acoustic features (at the frame level)# Multiple input features will be concatenated as the acoustic feature vector# Here I use the mgc and quantized F0 as the conditional features# Direction of each kind of feature, seperated by ','ExtInputDirs = ../RAWDATA,../RAWDATA
# File extensions of each kind of feature, seperated by ','# ExtInputExts = .mgc,.lf0_dis_class
# Dimension of each kind of features, seperated by '_'ExtInputDims = 60_1
CURRENNTWAVENET
31
ElementsofWaveNetl PostProcessingblocks• Thepostprocessingblocksmergesthefeaturesgenerated
bythewavenet blocks
Textual/acoustic features
1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
Waveform(time shifted)
1-D CNNs softmax Waveform+
1-D CNN +
Diluted1-D CNN
+
Tanh Sigmoid
*…
1-D CNN
1-D CNN +
CURRENNTWAVENET
32
ElementsofWaveNetl PostProcessingblocks• Innetwork.jsn2
{ "size": 256, "name": "postprocessingAdd","bias": 1.000000, "type": "skipadd", "preSkipLayer": "diluteB1L1skip,diluteB1L2skip,diluteB1L3skip,diluteB1L4skip,diluteB1L5skip,diluteB1L6skip,diluteB1L7skip,diluteB1L8skip,diluteB1L9skip,diluteB1L10skip"},{"size": 256,"name": "postprocessingL1","bias": 1.0,"type": "feedforward_tanh"},{"size": 256,"name": "output","bias": 1.0,"type": "feedforward_identity"},{"size": 1,"name": "postoutput","type": "mdn"}
1-D CNNs softmax Waveform+
1. “preSkipLayer”isthelistofskipini/skipadd/skipcat layers,fromwhichthefeatureswillbesummedup
2. Ofcourse,theselayersmusthavethesamesize
softmax
+
Skipadd
postprocessingAdd+
CURRENNTWAVENET
33
ElementsofWaveNetl PostProcessingblocks
skipadd
diluteB1L1skipadd
Diluted 1-D CNN diluteB1L1cnn
+
Tanh Sigmoid
*
1-D CNN
diluteB1L1skipouttrans
1-D CNNdiluteB1L1out
+
skipini
causalSkip
skipinidiluteB1L1temp
skipinidiluteB1L1skip
Wavenetc
diluteB1L1wavc
Skipadd
diluteB1L1temp2+
skipadd
diluteB1L2skipadd
Diluted 1-D CNN diluteB1L2cnn
+
Tanh Sigmoid
*
1-D CNN
diluteB1L2skipouttrans
1-D CNNdiluteB1L2out
+
skipinidiluteB1L2temp
skipinidiluteB1L2skip
Wavenetc
diluteB1L2wavc
Skipadd
diluteB1L2temp2+
Feedforward_tanhpostprocessingL1
Feedforward_linoutput
…
CURRENNTWAVENETContentsl ElementsoftheWaveNet inCURRENNT
l Multi-networkswithdifferenttimeresolution
l TimeVSmemoryconsumptionduringgeneration
34
CURRENNTWAVENET
35
Multiplenetworksl Withdifferenttimeresolution• Wemaywanttoprocessthetextual/acousticfeaturesusing
abi-directionalRNNnetwork• ButthisRNNnetworkshouldworkattheframe-level
• Thisisimplementedbyusingthetime“resolution”option
• Pleasechecknetwork.jsn.Thewavenet partisthesameasnetwork.jsn2
1-D CNN softmax Waveform+
…Wavenetblock
Wavenetblock
Wavenetblock
Waveform(time shifted)
Textual/acoustic features
Network
Sample-level
Frame-level
CURRENNTWAVENET
36
Multiplenetworksl Withdifferenttimeresolution• “network.jsn”use5layerstohandletextual/acousticdata
{ "size": 61, "name": "exInputL1", "type": "externalloader",
... }, { "size": 61, "name": "exInputSkip",
... }, { "size": 64, "name": "exInputL2", "type": "blstm",
... },
{ "size": 60, "name": "exInputL3", "type": "cnn",
... }, { "size": 61, "name": "exInputAdd", "type": "skipcat",
... },
skipini
exInputSkip
blstmexInputL2
CNN exInputL3
Input layer
Input index
(sample-level)
externalloader
exInputL1
Skipcat
exInputAdd
Textual/acousticfeatures
CURRENNTWAVENET
37
Multiplenetworksl Withdifferenttimeresolution• exInputL1 loadsthesample-levelinputindexandloadsthe
textualacousticfeatures
• "externalDataMV”canbeusedbyexInputL1forfeaturenormalization
• Path,dimensionsandotheroptionsofthetextual/acousticfeaturesaregiveninconfig.cfg &config_syn.cfg
{ "size": 61, "name": "exInputL1",
"type": "externalloader", "bias": 1.0,
"externalDataMV": "./data_mgcf0.mv.bin", "resolution": 80 },
Input layer
Input index
(sample-level)
externalloader
exInputL1
Textual/acousticfeatures
CURRENNTWAVENET
38
Multiplenetworksl Withdifferenttimeresolution• Butwhat’sthe“resolution”?
• Rememberthatinputindexisatthesample-levelwhilethesubnetworkprocessesframe-levelfeatures.Thissubnetworkworksataslowertempo
• “resolution”indicatestherelativetempoofthesubnetwork• Inthiscase,eachframeis5mswhileawaveformsampleis
(1/16)ms. Thus,theresolutionis5*16=80
• Note,“resolutions”shouldalsobegiventoconfig.cfg &config_syn.cfg (thisisforfutureimplementationwheremultipletimeresolutionscanbedefined)
{ "size": 61, "name": "exInputL1",
"type": "externalloader", "bias": 1.0,
"externalDataMV": "./data_mgcf0.mv.bin", "resolution": 80 },
Input layer
Input index
(sample-level)
externalloader
exInputL1
Textual/acousticfeatures
CURRENNTWAVENET
39
Multiplenetworksl Withdifferenttimeresolution• “resolution”hastwoeffects:• Ittellsthenetworktoallocatememoryspaceintermsofthe
numberofframesnotwaveformsamplingpoints• IttellstheexInputL1 toloadthetextual/acousticfeaturesat
theframe-level(notthesample-levelaswavNetCore does)
0 0 … 0 1 1 … 1 2 2 … 2
input index
frame-levelfeatures
externalloader
Featuresattheframe-level
Resolution=800 1 2
CURRENNTWAVENET
40
Multiplenetworksl Withdifferenttimeresolution• Ofcourse,amoreefficientwayisjustdirectlycopythe
externalframe-levelfeaturetothebufferofexternalLoader• The“resolution”isimplementedasamoreflexibletoolto
loadthedata
• “resolution”mustbeprovidedforallthelayersinthesub-network(formemoryallocation)
• Note:tochangethetimeresolution,wealsoneedtochangethepatTypes(),m_curSeqLength ,etc.SeevoidLayer<TDevice>::loadSequences incurrennt_lib/src/layers/Layer.cu formoredetails
CURRENNTWAVENET
41
Multiplenetworksl Withdifferenttimeresolution• Basedonthe“resolution”,wecanuseanytypeofnetwork
toprocessthetextual/acousticfeaturesattheframe-level• Innetwork.jsn,IusedaSkipCat layertoconcancate the
outputoftheCNNandtheF0.ThemotivationistousingtheoriginalF0astheinputtowavenet
skipini
exInputSkip
blstmexInputL2
CNN exInputL3
externalloader
exInputL1
Skipcat
exInputAdd
CURRENNTWAVENET
42
Multiplenetworksl Withdifferenttimeresolution• Finally,noticethat
• ThislayerFlag willtellsthefirstwavenetblocktousetheoutputofthissub-network,insteadofloadingexternaldatadirectly
• “contextMV”isunnecessaryforthewavenet blocksinceittakesasinputtheoutputfromthesubnetwork
skipini
exInputSkip
blstmexInputL2
CNN exInputL3
externalloader
exInputL1
Skipcat
exInputAdd
{"size": 61,"name": "exInputAdd","bias": 1.0,"type": "skipcat","resolution": 80,"preSkipLayer": "exInputL3,exInputSkip","preSkipLayerDim": "0_60_60_61","layerFlag": "wavenetConditionInputLayer"},
1-D CNN
Diluted1-D CNN
+
Tanh Sigmoid
*
1-D CNN
waveform
1-D CNN +
CURRENNTWAVENETContentsl ElementsoftheWaveNet inCURRENNT
l Multi-networkswithdifferenttimeresolution
l TimeVSmemoryconsumptionduringgeneration
43
CURRENNTWAVENET
44
Time&MemoryConsumptionl Duringtraining
• Eachlayerwillallocatememorybuffers(outputs,gradients…):
• Itdependsonthemaximumwaveformlength
• Usetruncate_seq inconfig.cfg ifyourGPUmemoryissmall
skipadd
diluteB1L1skipadd
Diluted 1-D CNN diluteB1L1cnn
+
Tanh Sigmoid
*
1-D CNNdiluteB1L1skipouttrans
1-D CNNdiluteB1L1out
+
skipinicausalSkip
skipinidiluteB1L1temp
skipinidiluteB1L1skip
Wavenetc
diluteB1L1wavc
Skipadd
diluteB1L1temp2+
… … …
dimension
MaximumwaveformlengthT
… … …
… … …
…
…
CURRENNTWAVENET
45
Time&MemoryConsumptionl Duringtesting
• “truncate_seq”mustnotbeusedbecausesamplingpointsinonewaveformmustbegeneratedwiththecorrectcausaldependency
• Method1:allocatethememoryforthewholewaveformlengtho Disadvantage:prohibitiverequiredontheGPUmemoryspace
(>10GBforgenerating1swaveformifthenetworkisnotsmall)ü Advantage:intermediateresultscanbesavedforeachtimestep
• Method2:allocatethememoryonlyfortherequireddependency
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
start
start start
CURRENNTWAVENET
46
Time&MemoryConsumptionl Duringtesting
• Checkthedependencyofeachlayer:• Feedforward/SkipIni/SkipCat … :thecurrenttimestep• Dilutated CNN:thecurrentstept andtheprevioust-R step,where
R isthedilutionsize• Thus
• Feedforward… :allocatememoryforonetimestep• Dilutated CNN:allocatememoryformultipletimesteps
• Implementationissomehowcomplicated.Pleasedebugfromcurrennt_lir/src/NeuralNetwork.cpp forwaveformgeneration
• EachlayertypewillhaveamethodcalledreduceOutputBuffer toreducethememoryallocation
• Eachlayertypewillchangethepointertoaccessthedatabuffer.
679 // for wavenet, reduce the memory in generation 680 if (flagSaveMemWavNet){681 // only save the memory for layers between the feedback and output layer682 if (counter < outputLayerIdx && counter > m_firstFeedBackLayer){683 if (counter != Configuration::instance().outputFromWhichLayer())
684 layer->reduceOutputBuffer(); 685 } 686 }
CURRENNTWAVENET
47
Time&MemoryConsumptionl Layerswithnocausaldependency
• Forexample,afeedforwardlayer:
0 1 2 … t-1 t t+1 … T
Memorybufferwithout reducing
Sincethememory iscolumnmajor, fortimet,thepointer shouldpointed tot *D
dimD
LengthT
Memorybufferafterreducing dimD
Everytwillusethesamebuffer.Theoriginalpointerpositionpos =t*D.Tocorrectitforthereducedbuffer, justletpos =pos – shift,whereshift=pos.
Whynotdirectlysetpos =0?Forflexibilitybecauseshiftmaybe0iftheprevious layerisnotreducedinmemory
CURRENNTWAVENET
48
Time&MemoryConsumptionl Layerswithnocausaldependency
• Forexample,afeedforwardlayer(currennt_lib/src/layers/FeedforwardLayer.cu)// Note, in the testing phase, network uses computeForwardPass(timeStep, nnState) and only generates // output feature for the current “timeStep”.// This is not the computeForwardPass(nnState) used for training554 template <typename TDevice, typename TActFn>555 void FeedForwardLayer<TDevice, TActFn>::computeForwardPass(const int timeStep,556 const int nnState)557 {...561562 int effTimeStart = timeStep * this->parallelSequences(); // start of current time step563 int effTimeEnd = (timeStep+1) * this->parallelSequences(); // end of current time step564565 // Pointer to the input buffer (the output buffer of previous layer)566 int shiftIn = this->precedingLayer().outputBufPtrBias(timeStep * this->parallelSequences(),567 nnState);568 // Pointer to the output of this layer569 int shiftOut = this->outputBufPtrBias(timeStep * this->parallelSequences(), nnState);...577 helpers::Matrix<TDevice> plOutputsMatrix(&this->precedingLayer().outputs(),578 this->precedingLayer().size(),579 this->parallelSequences(),580 (effTimeStart * this->precedingLayer().size()581 - shiftIn));// effTimeStart * this->precedingLayer().size() is the pointer of input buffer without memory reduction// shiftIn is the shift size// Because this layer doesn’t know whether the previous layer is reduced in memory or not, // it must use the method outputBufPtrBias() to retrieve the shift
CURRENNTWAVENET
49
Time&MemoryConsumptionl DilutedCNNlayers
• Nativeimplementation1(verynaïveone)
o Ateachtimestept,theCNN layeronlyrequirestheoutputofthepreviouslayerattimet andt-R,whereRisthedilutionsize.
o Thememoryrequiredforeachlayerisjust2.o Forexample,fort=4,Layer1mustallocatememorytostorethe
resultsoft=2 andt=4:
1
1
1
2
2
2
1 3
3
3
21 4
4
4
321
2
5
5
5
432
3
1
1
Thenumber denotesthetimeindexLayer2
Layer1
Layer0
1.Layer0calculatesfort=1andt=22.Layer1calculatesfort=2,storesit3.Layer0calculatesfort=3andt=44.Layer1calculatesfort=4,storesit5.Layer2calculatesfort=4
CURRENNTWAVENET
50
Time&MemoryConsumptionl DilutedCNNlayers
• Nativeimplementation1(verynaïveone)
o Butthisinvolvesheavyandduplicatedcomputationo Theduplicatedcomputationisindicatedbythecolorofthearrow
o Whatcanbedone? Savetheintermediateresults!
1
1
1
2
2
2
1 3
3
3
21 4
4
4
321
2
5
5
5
432
3
1
1
Thenumber denotesthetimeindexLayer2
Layer1
Layer0
TheFastWaveNet papersaysthepubliccodeoftensorflow-basedWaveNet usesthisnativeimplementation.Well,dothetensorflow-usersonlyknowhowtodrawthenetworkblocks?Itmaybeifthetensorflow-userswanttogetthingsdoneonlybydrawingnetworkblocks
51
1 1 2 1 2 3
1 2 3 4 1 2 3 4 5
Time&MemoryConsumptionl DilutedCNNlayers
• Nativeimplementation2(savealltheintermediateresults)
o Bydefault,CURRENNTusesthisimplementationo Largermemoryconsumption(asaforementionedinpp.45)
CURRENNTWAVENET
Therefore,theduplicatedcomputationarguedinFastWaveNet paperisnotreallyaproblemforthenaïveCURRENNTimplementation.Itistheduplicatedcomputation+memoryconsumption thatmatters.
52
1
1
2
2
1 3
3
2 4
4
3
21 21 31
5
5
4
3 421
Time&MemoryConsumptionl DilutedCNNlayers
• FastWavenet implementation(savenecessaryintermediateresults)
o Theblueboxisthememoryspacetostoreintermediateresults.IntheFastWaveNet paper,thismemoryspaceisaqueue
o Thisideaavoidstheduplicatedcomputationo Itisalsoefficientinmemoryconsumption.Thememorysize
requiredonlydependsonthedilutionsize
CURRENNTWAVENET
53
Time&MemoryConsumptionl DilutedCNNlayers
• CURRENNTimplementation(alsosavenecessaryintermediateresults)o Memoryspace=(dilution+1)*feature_dimensiono DifferencefromFastWaveNet implementation:nobothertousea
fancyqueueasthedatabuffer.
o ComputationinadilutedCNNlayer:Initial the buffer by 0
For t = 0:T
calculate memory address ptr1 = [t % (dilution_size + 1)]
calculate memory address ptr2 = [(t + 1) % (dilution_size + 1)]
transform the current output of previous layer as (using CNN filters)
store at ptr1
calculate output by using the data from ptr1 and ptr2
o Note[(t+1) % (dilution_size+1)] = [(t-dilution_size) % (dilution_size+1)] o Thus,ptr2 isjusttheaddressoft-dilution_size ina
circularbuffer(althoughthebufferisjustaplainbuffer)
CURRENNTWAVENET
it
ptr1 ptr2
ot
ot
itit
time Nativecomputation2 CURRENTImplementation
0
GetPtr1=0%(2+1)=0GetPtr2=(0+1)%(2+1)=1Store atPtr1SumdatafromPtr1 andPtr2
1
GetPtr1=1%(2+1)=1GetPtr2=(1+1)%(2+1)=2StoreatPtr1SumdatafromPtr1 andPtr2
2
GetPtr1=2%(2+1)=2GetPtr2=(2+1)%(2+1)=0Store atPtr1SumdatafromPtr1 andPtr2
54
0 0
0
0 1 2
Time&MemoryConsumptionl DilutedCNNlayers
• CURRENNTimplementation,examplefordilution=2
CURRENNTWAVENET
o0
o1
o2
i0 i0
i0
i0
i1i1
i1 i2
i2
i0
o0
i0
o0
i0
o0
o1
i1
o1
i1
o2
i2
55
Time&MemoryConsumptionl DilutedCNNlayers
• CURRENNTimplementation,examplefordilution=2
• Theimplementationissimple,fast,andmemory-friendly
CURRENNTWAVENET
i3
time Nativecomputation2 CURRENTImplementation
3
GetPtr1=3%(2+1)=0GetPtr2=(3+1)%(2+1)=1Store atPtr1(overwritebuffer)SumdatafromPtr1 andPtr2
4
GetPtr1=4%(2+1)=1GetPtr2=(4+1)%(2+1)=2Store atPtr1(overwritebuffer)SumdatafromPtr1 andPtr2
o3
o4
i1 i2i3
i4 i2i3i4
i0
o0
i0
o0
o1
i1
o1
i1
o2
i2
o2
i2
o3
i3
o3
i3
o4
i4
56
Time&MemoryConsumptionl DilutedCNNlayers
• Moredetailsabouto SupposethisCNNlayerhas2outputchannels(sothedimensionof
outputfeatureis2andthenumberofCNNfilteris2)o Supposethedimensionofpreviouslayer’soutputis3
o SeeCNNLayer<TDevice>::computeForwardPass(const int timeStep, const int nnState)forthetwostepsabove
o PleasecheckslidesonCURRENNT_CNNforCNNimplementation
CURRENNTWAVENET
o4
i4 i2i3
CNN causal Filters
0
0
0
it
ot
0
0
0
Step2: data summation
Step1: matrix transformationi4 will be stored in the buffer
0 0 0
0 0 0
0
0
a
0
b
0
c
0
d
0
0
0
=
a4
i4
a4
a+c
b+d
CURRENNTWAVENETFinallyl WehavetrainedWaveNet-basedvocoder.Thenumberof
blocksis40.Pleasecheckthehttps://github.com/TonyWangX/CURRENNT_Recipes/tree/master/temp/compareTTS/WAVENET
l Waveformgeneratedgiventhenaturalacousticfeaturesisveryclosetothenaturalspeech
l Waveformgeneratedgiventhegeneratedacousticfeaturesachievesasimilarscoreascopy-synthesisspeech,althoughthespeakersimilaritydegrades
l So,pleasetrytousetheWaveNet onyourdatal Emailmeifyouhaveanyquestion
57