Download pdf - CURRENNT WaveNet Implementation - GitHub Pagestonywangx.github.io › pdfs › CURRENNT_WAVENET.pdf · 2019-12-29 · CURRENNT WaveNet Implementation 1 Xin WANG National Institute

CURRENNTWaveNet Implementation

1

XinWANGNationalInstituteofInformatics,Japan

2017-11-03

contact:[email protected],suggestions,discussionarewelcome

CURRENNTWAVENETContentsl ElementsoftheWaveNet inCURRENNT

l Multi-networkswithdifferenttimeresolution

l TimeVSmemoryconsumptionduringgeneration

2

CURRENNTWAVENETBeforestaringl PleasechecktheslidesonthebasicsofCURRENNTl PleaseunderstandhowtodebugwithCURRENNTl PleasesetuptheCURRENNT_SCRIPTandruntheexample

CONFIGPOOL/config_wavenet.pm

l Afterrunningtheexample,pleaseCDintothedirectoryofWaveNet example

3

~$ cd CURRENNT_DIS/EXAMPLE/MODEL_WAVENET~$ pwd…/CURRENNT_DIS/EXAMPLE/MODEL_WAVENET~$ lsREADME createMDNConfig.py mdn.config output_trained_network_mdn1.000000config.cfg data_mgcf0.mv.bin network.jsn trained_network.jsnconfig_syn.cfg log_train network.jsn2

General

l ThisistheWaveNet structureusedforexperimentsl Thestructureisslightlydifferentfromtheliterature.Iwillexplain

itlater

CURRENNTWAVENET

4

Textual/acoustic features

1-D CNN

Diluted1-D CNN

Sub-network

+

Tanh Sigmoid

*

1-D CNN

Waveform(time shifted)

1-D CNNs softmax Waveform+

1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*…

1-D CNN

1-D CNN +

CURRENNTWAVENETElementsofWaveNet

l First,considerthecasewithoutsub-networkfortext/acousticfeatures(whichisthenetwork.jsn2 intheMODEL_WAVENET)

5


1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN



1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*…

1-D CNN

1-D CNN +

CURRENNTWAVENETElementsofWaveNetl “Input”layers• Let’slookatthefollowinglayersinnetwork.jsn2

6

1-D CNN


"layers": [ { "size": 1, // The input layer size is 1 "name": "input", // But waveform is not loaded here "type": "input" }, { "size": 256, // Waveform is loaded here as input "name": "feedbackBottom", "bias": 1.0, "type": "feedback", "previousDimEnd": 0, "previousDimStart": 0 }, { "size": 64, // Feed the waveform one-hot vector "name": "causalEmbedding", // into the feedforward layer "bias": 1.0,

"type": "feedforward_identity" },

1-D CNN

Waveform

input

feedback

index

Implementation innetwork.jsn

CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2

• Thislayerloadsasequenceofindex

• Waveformdataareatthesample-level.Buttheacoustic/textualfeaturesareattheframelevel

• Theinputindexjusttellsthenetwork:whichframedoeseachwaveformsamplingpointbelongto 7

{ "size": 1, // The input layer size is 1 "name": "input", // But waveform is not loaded here "type": "input" },

input

index



Wavenet

Waveform

Sample-level Frame-level

CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2• InMODEL_WAVENET,thewaveformis16kHzmu-lawwhile

theframe-shiftis5ms

• Therefore,everyframecorrespondto16*5=80waveformsamplingpoints

8

0 0 … 0 1 1 … 1 2 2 … 2Sample-levelInputindex

frame-levelacoustic/textualfeatures

Frame0

Frame1

Frame2

80numbers

CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2• Let’stakealookattheinputindex

• ***.labindx inRAWDARAistheinputdatathatwillbepackagedintothedata.nc*

• Waveform***.rawispackagedastheoutputdata

• PleasecheckCURRENNT_DIS/EXAMPLE/DATA_WAVENET9

# Please use Python and load the pyTools>> from ioTools import readwrite>> cd CURRENNT_DIS/EXAMPLE/RAWDATA>> data = readwrite.read_raw_mat('BC2011_nancy_NYT096-008-00.labindx', 1)>> data[0:80]array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0 … 0.], dtype=float32)# data[0:80] is zero>> data[80:160]array([ 1., … 1.], dtype=float32)# data[80:160] = 1>> data[::80]array([ 0., 1., 2., 3., 4., 5., 6., 7., 8.,… 943., 944., 945.], dtype=float32)# data is just a sequence of index, where data[i] = i/80# data.shape = (75680, 0), which has the same length as the waveform BC2011_nancy_NYT096-008-00.raw

CURRENNTWAVENETElementsofWaveNetl Inputlayerinnetwork.jsn2• TakealookattheCURRENNT

• Ifyouusedifferentsamplingrateandframeshift,pleasegeneratedthecorrectinputindex

10

~$ gdb --args currennt_debug --options_file config.cfg --cuda off(gdb) b InputLayer.cu:157(gdb) r157 thrust::copy(fraction.inputs().begin(), fraction.inputs().end(),(gdb) p fraction.inputs()[0] // fraction loads input from data.nc$1 = (float &) @0x7fffd4001660: 0 // fraction.inputs() stores the input index(gdb) p fraction.inputs()[80]$2 = (float &) @0x7fffd40017a0: 1 // the index increases every 80 sampling points(gdb) p fraction.inputs()[160]$3 = (float &) @0x7fffd40018e0: 2 // They will be copied to the this->outputs()

CURRENNTWAVENETElementsofWaveNetl Feedbacklayerinnetwork.jsn2

• Becausewaveformisnotstoreasinputdata,itmustberetrievedfromtheoutputbuffer.Thisisimplementedbythefeedbacklayer

• Usually,afeedbacklayerwillconcatenatetheoutputdatafromthepreviouslayerandthefeedbackdata.Here,itisunnecessarytoloadtheoutputfromthepreviouslayer(whichisthesequenceofindex)

• ThuspreviousDimEnd=0andpreviousDimStart=0willtellthislayernottoloadtheoutputfromthepreviouslayer 11

{ // the example waveform is 8bit mu-law, thus "size": 256, // each waveform point is a one-hot vector of 256 dim "name": "feedbackBottom", "bias": 1.0, "type": "feedback", "previousDimEnd": 0, // don’t load input from previous layer (input layer) "previousDimStart": 0 },

Waveform feedback

CURRENNTWAVENETElementsofWaveNetl Feedbacklayerinnetwork.jsn2• Let’sdebug

• Checkthefunctor vectorFillForward incurrennt_lib/src/layers/FeedBackLayer.cu

12

(gdb) b FeedBackLayer.cu:606606 thrust::fill(this->outputs().begin(), this->outputs().end(), 0.0);// The code after this line is the place where the feedback data are loaded// This implementation is based on the functor internal::vectorFillForward// You can debug through this part step by step// For example:(gdb) b 626(gdb) c626 fn.input2= helpers::getRawPointer(m_targetLayer->feedbackOutputs(true));(gdb) p m_targetLayer->name() // m_targetLayer is the layer from which the waveform is fed back$4 = "postoutput” // In this case, it is the last postoutput layer

105 if (lookBack != NULL)106 lookBackTime = lookBack[dimIdx / dimInput2Valid] * parallel;107 else108 lookBackTime = 1; // by default, the feed back data will be shifted by 1 time unit109 // thus, no need to worry about the time shifting operation110111 dimIdx = dimIdx % dimInput2Valid;112113 if (timeStep < lookBackTime)114 output[outputIdx] = 0.0; //115 else{116 output[outputIdx] = input2[(timeStep - lookBackTime) * dimInput2Total +117 dimIdx + dimInput2Start];

CURRENNTWAVENETElementsofWaveNetl Feedbacklayerinnetwork.jsn2• Thisfeedbacklayershiftthewaveformbyonestep

13

Feedback

1 1

1 1 1

1 1 1 1 1

1 1

Originalwaveform

m_targetLayer->feedbackOutputs(true)

0 1

0 1 1 1

0 1 1 1 1 1

0 1 1

Feedbackwaveform

this->outputs()

CURRENNTWAVENETElementsofWaveNetl 1-DCNNlayersinnetwork.jsn2• Thentheshiftedwaveformone-hotvectorswillgo

throughthe1-DCNN(feedforwardlayer)andbecometheembeddedvectors

• 1-DCNNisimplementedbyasimplefeedforwardlayer

14

0 1

0 1 1 1

0 1 1 1 1 1

0 1 1

Feedbackwaveform

1-D CNN

0 0.1

0 1.2

0 …

1-D CNN

Waveform

input

feedback

index

Embeddedvectors

CURRENNTWAVENETElementsofWaveNetl Summaryof“inputlayers”

• Inputlayerloadstheinputindex• Feedbacklayerretrievesandshiftsthewaveform• 1-DCNNconvertsthewaveformone-hot-vectorinto

embeddedvectors

15

1-D CNN


1-D CNN

Waveform

input

feedback

index

Implementation innetwork.jsn2

CURRENNTWAVENET

16

ElementsofWaveNetl WaveNet Block• Whynotthisstructure===========>• Bothworks(asexperimentssaid)• Thestructurebelowforces1-DCNNto

usetheflowfromthehighwaypass

Textual/acoustic features1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN


1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*

…

1-D CNN

1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN

1-D CNN +

Ithought theleftstructurecouldbeeasiertoanalyze

CURRENNTWAVENET

17

ElementsofWaveNetl WaveNet Block• Network.jsn2contains10WaveNet blocks.TheirCNN

dilutionsizeis{1,2,4,8,16,32,64,128,256,512}• These10blocksformonegroup.• Layersofonewavenet blockisnamedasdiluteB*L***

~$ grep name network.jsn2"name": "input","name": "feedbackBottom","name": "causalEmbedding","name": "causalSkip","name": "diluteB1L1cnn","name": "diluteB1L1wavc","name": "diluteB1L1out","name": "diluteB1L1skipadd","name": "diluteB1L1temp","name": "diluteB1L1skipouttrans","name": "diluteB1L1skip","name": "diluteB1L1temp2","name": "diluteB1L2cnn","name": "diluteB1L2wavc","name": "diluteB1L2out","name": "diluteB1L2skipadd","name": "diluteB1L2temp","name": "diluteB1L2skipouttrans","name": "diluteB1L2skip","name": "diluteB1L2temp2","name": "diluteB1L3cnn",…

Inputlayers

1st blockinthe1st groupcausalSkipisaspeciallayerfortheB1L1

2nd blockinthe1st group

B:group indexL:blockindexinthegroup

CURRENNTWAVENET

18

ElementsofWaveNetl WaveNet Block• BlockB1L1

{ "size": 64, "name": "causalSkip", "bias": 1.0, "type": "skipini" }, { "size": 128, "name": "diluteB1L1cnn", "bias": 1.0, "type": "cnn", "window_width": "128*1", "window_tap_interval": "128*1", "causal": 1, "outputTanh": 0 }, { "size": 64, "name": "diluteB1L1wavc", "bias": 1.0, "type": "wavnetc", "contextDim": 61, "contextMV": "./data_mgcf0.mv.bin" }, { "size": 64, "name": "diluteB1L1out", "bias": 1.0,

"type": "feedforward_identity" }, { "size": 64, "name": "diluteB1L1skipadd", "bias": 1.000000, "type": "skipadd",

"preSkipLayer": "causalSkip,diluteB1L1out" },


Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN


1-D CNN +

…

Thestartingpointoftheskip-addconnection

DilutedCNN

WaveNet core

Feedforward(1-DCNN)

Skip-add

CURRENNTWAVENET

19

ElementsofWaveNetl WaveNet Block• BlockB1L1

• CURRENNTisnotsoflexible.Ifalayer’soutputistobeusedbymultiplelayers,“skipini”(and“skipadd”)mustbeusedtoprovidemultipleoutputpipes

{ "size": 64, "name": "diluteB1L1temp", "bias": 1.0, "type": "skipini" }, { "size": 256,

"name": "diluteB1L1skipouttrans", "bias": 1.0,

"type": "feedforward_identity" }, { "size": 256, "name": "diluteB1L1skip", "bias": 1.0, "type": "skipini" }, { "size": 64, "name": "diluteB1L1temp2", "bias": 1.0, "type": "skipadd",

"preSkipLayer": "diluteB1L1temp" },

diluteB1L1skip

diluteB1L1temp2


Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN

1-D CNN +

CURRENNTWAVENET

20

ElementsofWaveNetl WaveNet Block

"name": "causalSkip","name": "diluteB1L1cnn","name": "diluteB1L1wavc","name": "diluteB1L1out","name": "diluteB1L1skipadd","name": "diluteB1L1temp","name": "diluteB1L1skipouttrans","name": "diluteB1L1skip","name": "diluteB1L1temp2",


1-D CNN

Diluted 1-D CNN diluteB1L1cnn

+

Tanh Sigmoid

*

1-D CNNdiluteB1L1skipouttrans


Wavenet Block:B1L1

1-D CNNdiluteB1L1out +

skipadd

diluteB1L1skipadd

skipinicausalSkip

skipinidiluteB1L1temp

skipinidiluteB1L1skip

Wavenetc

diluteB1L1wavc

skipadddiluteB1L1temp2


1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN


1-D CNN +

Inimplementation, onewavenet blockhas9layers:

skipadd

diluteB1L1skipadd

CURRENNTWAVENET

21

ElementsofWaveNetl WaveNet Block• Whysocomplicated?• BecauseCURRENNT

onlysupportsalinearstructureofnetwork

• Biparite ormulti-paritestructuremustbeimplementedbyusingskip-connetions


+

Tanh Sigmoid

*


1-D CNNdiluteB1L1out

+

skipinicausalSkip



Wavenetc

diluteB1L1wavc

Skipadd

diluteB1L1temp2+ Nextwavenet block

Postprocessingblock

L1 L2

L0L1

L2

L0

skipini

skipadd

• ThisprincipleisexplainedintheslidesonCURRENNT_HIGHWAY

skipadd

diluteB1L1skipadd

CURRENNTWAVENET

22


+

Tanh Sigmoid

*



+

skipinicausalSkip



Wavenetc

diluteB1L1wavc

Skipadd

diluteB1L1temp2+

skipadd

diluteB1L2skipadd


+

Tanh Sigmoid

*



+



Wavenetc

diluteB1L2wavc

Skipadd

diluteB1L2temp2+

…

Here,noneedtoaddanother skipinilayer

CURRENNTWAVENET

23

ElementsofWaveNetl WaveNet CoreBlock• Theconditionfeatures(textualoracoustic)areloadedby

the‘wavenetc’layer• ThislayerwillnormalizetheconditionfeaturesifcontextMV

isprovided

{"size": 64,"name": "diluteB1L1wavc","bias": 1.0,"type": "wavnetc","contextDim": 61, // dimension of the condition features"contextMV": "./data_mgcf0.mv.bin”},

//Intheexample,Iusemgc (60dim)andF0(1dim)asthecondition.//data_mgcf0.mv.binisabinaryvector[mean_MGC,mean_F0,std_MGC,std_F0]//Thus,thelengthofthevectoriscontextDim *2//Themv.bin canbereadandwrittenbyusingthepyTools>> fromioTools importreadwrite>> datamv =readwrite.read_raw_mat(‘./data_mgcf0.mv.bin’,1)>> datamv.shape(122,)Textual/acoustic

features1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN


1-D CNN +

wavenetc

CURRENNTWAVENET

24

ElementsofWaveNetl WaveNet CoreBlock• Rememberthattheconditionacousticfeaturesareframe-

levelfeatureswhilethenetworkworksatthesample-level• Heretheinputindexisusedforthisalignment

0 0 … 0 1 1 … 1 2 2 … 2

input index

frame-levelfeatures

… … …

wavenetc

Featuresatthesample-level

CURRENNTWAVENET

25

ElementsofWaveNetl WaveNet CoreBlock• Lookatthecodecurrennt_lib/src/layers/wavNetCore.cu

0 0 … 0 1 1 … 1 2 2 … 2

input index

frame-levelfeatures

wavenetc

381 template <typename TDevice>382 void WavNetCore<TDevice>::loadSequences(const data_sets::DataSetFraction &fraction,383 const int nnState)384 {…392 // load the input index to m_contextRawBuf393 thrust::copy(fraction.inputs().begin(),fraction.inputs().end(),m_contextRawBuf.begin());…

// load the textual/acoustic features into m_contextRawBuf too.// however, the data is stored from m_contextRawBuf[PTR_SHIFT]// PTR_SHIFT is decided by the maximum length of the utterances and parallel size

405 thrust::copy(fraction.exInputData().begin(), fraction.exInputData().end(),406 (m_contextRawBuf.begin() +407 this->maxSeqLength() * this->parallelSequences()));

CURRENNTWAVENET

26

ElementsofWaveNetl WaveNet CoreBlock• Lookatthecodecurrennt_lib/src/layers/wavNetCore.cu

341 template <typename TDevice>342 void WavNetCore<TDevice>::__loadContextBuff()343 {...347 int dataPos = this->maxSeqLength() * this->parallelSequences();...353 // Load the acoustic/textual features into the sample-level buffer m_contextBuf354 {{355 internal::loadLinguisticFeature fn1;356 fn1.featureDim = m_contextDim; // dimension of the features357 fn1.paralNum = this->parallelSequences(); // parallel size358 fn1.maxFeatureLength = m_contextCurMaxLength; // maximum length (sample-level)

// starting point of the frame-level feature359 fn1.sourceData = helpers::getRawPointer(m_contextRawBuf) + dataPos;

// starting point of frame index360 fn1.frameIndex = helpers::getRawPointer(m_contextRawBuf);...

362 // load the mean and std for feature normalization363 fn1.contextMV = ((m_contextMV.size() == m_contextDim * 2)?364 helpers::getRawPointer(m_contextMV) : NULL);365... // executes the loading process367 thrust::for_each(...374 fn1);375 }}

CURRENNTWAVENET

27

ElementsofWaveNetl WaveNet CoreBlock• Note:toreducetheoverheadonloadingandmemory

allocation,onlytheB1L1willloadthetextual/acousticfeaturesandduplicatethemtothesample-level

• Otherblocksjustreadthebufferm_contextBuf ofB1L1.Otherblocksdon’tneed“contextMV”.But“contextDim”mustbeprovided


1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN

1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN

1-D CNN +

Pleasecheckcurrennt_lib/src/layers/waveNetCore.cu fordetailedimplementation

CURRENNTWAVENET

28

ElementsofWaveNetl WaveNet CoreBlock• Note:thewavenetcwilltransformthedimensionofthe

conditionfeaturesbeforetheyareaddedwiththeinputwaveformfeatures

• Thistransformationisconductedinsidewavenetc


1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN


1-D CNN +

wavenetc

ff

Pleasecheckcurrennt_lib/src/layers/waveNetCore.cu:void WavNetCore<TDevice>::computeForwardPass(const int nnState){…// Step1. transform the linguistic context

…}

CURRENNTWAVENET

29

ElementsofWaveNetl WaveNet CoreBlock• Alsonote:thesizeoftheoutputfromDiluted1-DCNNmust

beequaltothesizeofwavenetc *2


1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN


1-D CNN +

wavenetc

ff

{ "size": 128, "name": "diluteB1L1cnn", "bias": 1.0, "type": "cnn", "window_width": "128*1", "window_tap_interval": "128*1", "causal": 1, "outputTanh": 0 }, { "size": 64, "name": "diluteB1L1wavc", "bias": 1.0, "type": "wavnetc", "contextDim": 61, "contextMV": "" },

CURRENNTWAVENET

30

ElementsofWaveNetl WaveNet CoreBlock• Note:textual/acousticfeaturesmustbeprovidedasexternal

data

• Thepathtothedirectory,featuredimensionandotherconfigurationsmustbegiventoconfig.cfg &config_syn.cfg

• Pleasecheckconfig.cfg andconfig_syn.cfg# Conditional acoustic features (at the frame level)# Multiple input features will be concatenated as the acoustic feature vector# Here I use the mgc and quantized F0 as the conditional features# Direction of each kind of feature, seperated by ','ExtInputDirs = ../RAWDATA,../RAWDATA

# File extensions of each kind of feature, seperated by ','# ExtInputExts = .mgc,.lf0_dis_class

# Dimension of each kind of features, seperated by '_'ExtInputDims = 60_1

CURRENNTWAVENET

31

ElementsofWaveNetl PostProcessingblocks• Thepostprocessingblocksmergesthefeaturesgenerated

bythewavenet blocks


1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN



1-D CNN +

Diluted1-D CNN

+

Tanh Sigmoid

*…

1-D CNN

1-D CNN +

CURRENNTWAVENET

32

ElementsofWaveNetl PostProcessingblocks• Innetwork.jsn2

{ "size": 256, "name": "postprocessingAdd","bias": 1.000000, "type": "skipadd", "preSkipLayer": "diluteB1L1skip,diluteB1L2skip,diluteB1L3skip,diluteB1L4skip,diluteB1L5skip,diluteB1L6skip,diluteB1L7skip,diluteB1L8skip,diluteB1L9skip,diluteB1L10skip"},{"size": 256,"name": "postprocessingL1","bias": 1.0,"type": "feedforward_tanh"},{"size": 256,"name": "output","bias": 1.0,"type": "feedforward_identity"},{"size": 1,"name": "postoutput","type": "mdn"}


1. “preSkipLayer”isthelistofskipini/skipadd/skipcat layers,fromwhichthefeatureswillbesummedup

2. Ofcourse,theselayersmusthavethesamesize

softmax

+

Skipadd

postprocessingAdd+

CURRENNTWAVENET

33

ElementsofWaveNetl PostProcessingblocks

skipadd

diluteB1L1skipadd


+

Tanh Sigmoid

*

1-D CNN

diluteB1L1skipouttrans


+

skipini

causalSkip



Wavenetc

diluteB1L1wavc

Skipadd

diluteB1L1temp2+

skipadd

diluteB1L2skipadd


+

Tanh Sigmoid

*

1-D CNN

diluteB1L2skipouttrans


+



Wavenetc

diluteB1L2wavc

Skipadd

diluteB1L2temp2+

Feedforward_tanhpostprocessingL1

Feedforward_linoutput

…




34

CURRENNTWAVENET

35

Multiplenetworksl Withdifferenttimeresolution• Wemaywanttoprocessthetextual/acousticfeaturesusing

abi-directionalRNNnetwork• ButthisRNNnetworkshouldworkattheframe-level

• Thisisimplementedbyusingthetime“resolution”option

• Pleasechecknetwork.jsn.Thewavenet partisthesameasnetwork.jsn2

1-D CNN softmax Waveform+

…Wavenetblock

Wavenetblock

Wavenetblock



Network

Sample-level

Frame-level

CURRENNTWAVENET

36

Multiplenetworksl Withdifferenttimeresolution• “network.jsn”use5layerstohandletextual/acousticdata

{ "size": 61, "name": "exInputL1", "type": "externalloader",

... }, { "size": 61, "name": "exInputSkip",

... }, { "size": 64, "name": "exInputL2", "type": "blstm",

... },

{ "size": 60, "name": "exInputL3", "type": "cnn",

... }, { "size": 61, "name": "exInputAdd", "type": "skipcat",

... },

skipini

exInputSkip

blstmexInputL2

CNN exInputL3

Input layer

Input index

(sample-level)

externalloader

exInputL1

Skipcat

exInputAdd

Textual/acousticfeatures

CURRENNTWAVENET

37

Multiplenetworksl Withdifferenttimeresolution• exInputL1 loadsthesample-levelinputindexandloadsthe

textualacousticfeatures

• "externalDataMV”canbeusedbyexInputL1forfeaturenormalization

• Path,dimensionsandotheroptionsofthetextual/acousticfeaturesaregiveninconfig.cfg &config_syn.cfg

{ "size": 61, "name": "exInputL1",

"type": "externalloader", "bias": 1.0,

"externalDataMV": "./data_mgcf0.mv.bin", "resolution": 80 },

Input layer

Input index

(sample-level)

externalloader

exInputL1


CURRENNTWAVENET

38

Multiplenetworksl Withdifferenttimeresolution• Butwhat’sthe“resolution”?

• Rememberthatinputindexisatthesample-levelwhilethesubnetworkprocessesframe-levelfeatures.Thissubnetworkworksataslowertempo

• “resolution”indicatestherelativetempoofthesubnetwork• Inthiscase,eachframeis5mswhileawaveformsampleis

(1/16)ms. Thus,theresolutionis5*16=80

• Note,“resolutions”shouldalsobegiventoconfig.cfg &config_syn.cfg (thisisforfutureimplementationwheremultipletimeresolutionscanbedefined)

{ "size": 61, "name": "exInputL1",

"type": "externalloader", "bias": 1.0,

"externalDataMV": "./data_mgcf0.mv.bin", "resolution": 80 },

Input layer

Input index

(sample-level)

externalloader

exInputL1


CURRENNTWAVENET

39

Multiplenetworksl Withdifferenttimeresolution• “resolution”hastwoeffects:• Ittellsthenetworktoallocatememoryspaceintermsofthe

numberofframesnotwaveformsamplingpoints• IttellstheexInputL1 toloadthetextual/acousticfeaturesat

theframe-level(notthesample-levelaswavNetCore does)

0 0 … 0 1 1 … 1 2 2 … 2

input index

frame-levelfeatures

externalloader

Featuresattheframe-level

Resolution=800 1 2

CURRENNTWAVENET

40

Multiplenetworksl Withdifferenttimeresolution• Ofcourse,amoreefficientwayisjustdirectlycopythe

externalframe-levelfeaturetothebufferofexternalLoader• The“resolution”isimplementedasamoreflexibletoolto

loadthedata

• “resolution”mustbeprovidedforallthelayersinthesub-network(formemoryallocation)

• Note:tochangethetimeresolution,wealsoneedtochangethepatTypes(),m_curSeqLength ,etc.SeevoidLayer<TDevice>::loadSequences incurrennt_lib/src/layers/Layer.cu formoredetails

CURRENNTWAVENET

41

Multiplenetworksl Withdifferenttimeresolution• Basedonthe“resolution”,wecanuseanytypeofnetwork

toprocessthetextual/acousticfeaturesattheframe-level• Innetwork.jsn,IusedaSkipCat layertoconcancate the

outputoftheCNNandtheF0.ThemotivationistousingtheoriginalF0astheinputtowavenet

skipini

exInputSkip

blstmexInputL2

CNN exInputL3

externalloader

exInputL1

Skipcat

exInputAdd

CURRENNTWAVENET

42

Multiplenetworksl Withdifferenttimeresolution• Finally,noticethat

• ThislayerFlag willtellsthefirstwavenetblocktousetheoutputofthissub-network,insteadofloadingexternaldatadirectly

• “contextMV”isunnecessaryforthewavenet blocksinceittakesasinputtheoutputfromthesubnetwork

skipini

exInputSkip

blstmexInputL2

CNN exInputL3

externalloader

exInputL1

Skipcat

exInputAdd

{"size": 61,"name": "exInputAdd","bias": 1.0,"type": "skipcat","resolution": 80,"preSkipLayer": "exInputL3,exInputSkip","preSkipLayerDim": "0_60_60_61","layerFlag": "wavenetConditionInputLayer"},

1-D CNN

Diluted1-D CNN

+

Tanh Sigmoid

*

1-D CNN

waveform

1-D CNN +




43

CURRENNTWAVENET

44

Time&MemoryConsumptionl Duringtraining

• Eachlayerwillallocatememorybuffers(outputs,gradients…):

• Itdependsonthemaximumwaveformlength

• Usetruncate_seq inconfig.cfg ifyourGPUmemoryissmall

skipadd

diluteB1L1skipadd


+

Tanh Sigmoid

*



+

skipinicausalSkip



Wavenetc

diluteB1L1wavc

Skipadd

diluteB1L1temp2+

… … …

dimension

MaximumwaveformlengthT

… … …

… … …

…

…

CURRENNTWAVENET

45

Time&MemoryConsumptionl Duringtesting

• “truncate_seq”mustnotbeusedbecausesamplingpointsinonewaveformmustbegeneratedwiththecorrectcausaldependency

• Method1:allocatethememoryforthewholewaveformlengtho Disadvantage:prohibitiverequiredontheGPUmemoryspace

(>10GBforgenerating1swaveformifthenetworkisnotsmall)ü Advantage:intermediateresultscanbesavedforeachtimestep

• Method2:allocatethememoryonlyfortherequireddependency

0 1 2 3 4 5 6 7 8 9 10 11

0 1 2 3 4 5 6 7 8 9 10 11

start

start start

CURRENNTWAVENET

46

Time&MemoryConsumptionl Duringtesting

• Checkthedependencyofeachlayer:• Feedforward/SkipIni/SkipCat … :thecurrenttimestep• Dilutated CNN:thecurrentstept andtheprevioust-R step,where

R isthedilutionsize• Thus

• Feedforward… :allocatememoryforonetimestep• Dilutated CNN:allocatememoryformultipletimesteps

• Implementationissomehowcomplicated.Pleasedebugfromcurrennt_lir/src/NeuralNetwork.cpp forwaveformgeneration

• EachlayertypewillhaveamethodcalledreduceOutputBuffer toreducethememoryallocation

• Eachlayertypewillchangethepointertoaccessthedatabuffer.

679 // for wavenet, reduce the memory in generation 680 if (flagSaveMemWavNet){681 // only save the memory for layers between the feedback and output layer682 if (counter < outputLayerIdx && counter > m_firstFeedBackLayer){683 if (counter != Configuration::instance().outputFromWhichLayer())

684 layer->reduceOutputBuffer(); 685 } 686 }

CURRENNTWAVENET

47

Time&MemoryConsumptionl Layerswithnocausaldependency

• Forexample,afeedforwardlayer:

0 1 2 … t-1 t t+1 … T

Memorybufferwithout reducing

Sincethememory iscolumnmajor, fortimet,thepointer shouldpointed tot *D

dimD

LengthT

Memorybufferafterreducing dimD

Everytwillusethesamebuffer.Theoriginalpointerpositionpos =t*D.Tocorrectitforthereducedbuffer, justletpos =pos – shift,whereshift=pos.

Whynotdirectlysetpos =0?Forflexibilitybecauseshiftmaybe0iftheprevious layerisnotreducedinmemory

CURRENNTWAVENET

48

Time&MemoryConsumptionl Layerswithnocausaldependency

• Forexample,afeedforwardlayer(currennt_lib/src/layers/FeedforwardLayer.cu)// Note, in the testing phase, network uses computeForwardPass(timeStep, nnState) and only generates // output feature for the current “timeStep”.// This is not the computeForwardPass(nnState) used for training554 template <typename TDevice, typename TActFn>555 void FeedForwardLayer<TDevice, TActFn>::computeForwardPass(const int timeStep,556 const int nnState)557 {...561562 int effTimeStart = timeStep * this->parallelSequences(); // start of current time step563 int effTimeEnd = (timeStep+1) * this->parallelSequences(); // end of current time step564565 // Pointer to the input buffer (the output buffer of previous layer)566 int shiftIn = this->precedingLayer().outputBufPtrBias(timeStep * this->parallelSequences(),567 nnState);568 // Pointer to the output of this layer569 int shiftOut = this->outputBufPtrBias(timeStep * this->parallelSequences(), nnState);...577 helpers::Matrix<TDevice> plOutputsMatrix(&this->precedingLayer().outputs(),578 this->precedingLayer().size(),579 this->parallelSequences(),580 (effTimeStart * this->precedingLayer().size()581 - shiftIn));// effTimeStart * this->precedingLayer().size() is the pointer of input buffer without memory reduction// shiftIn is the shift size// Because this layer doesn’t know whether the previous layer is reduced in memory or not, // it must use the method outputBufPtrBias() to retrieve the shift

CURRENNTWAVENET

49

Time&MemoryConsumptionl DilutedCNNlayers

• Nativeimplementation1(verynaïveone)

o Ateachtimestept,theCNN layeronlyrequirestheoutputofthepreviouslayerattimet andt-R,whereRisthedilutionsize.

o Thememoryrequiredforeachlayerisjust2.o Forexample,fort=4,Layer1mustallocatememorytostorethe

resultsoft=2 andt=4:

1

1

1

2

2

2

1 3

3

3

21 4

4

4

321

2

5

5

5

432

3

1

1

Thenumber denotesthetimeindexLayer2

Layer1

Layer0

1.Layer0calculatesfort=1andt=22.Layer1calculatesfort=2,storesit3.Layer0calculatesfort=3andt=44.Layer1calculatesfort=4,storesit5.Layer2calculatesfort=4

CURRENNTWAVENET

50


• Nativeimplementation1(verynaïveone)

o Butthisinvolvesheavyandduplicatedcomputationo Theduplicatedcomputationisindicatedbythecolorofthearrow

o Whatcanbedone? Savetheintermediateresults!

1

1

1

2

2

2

1 3

3

3

21 4

4

4

321

2

5

5

5

432

3

1

1

Thenumber denotesthetimeindexLayer2

Layer1

Layer0

TheFastWaveNet papersaysthepubliccodeoftensorflow-basedWaveNet usesthisnativeimplementation.Well,dothetensorflow-usersonlyknowhowtodrawthenetworkblocks?Itmaybeifthetensorflow-userswanttogetthingsdoneonlybydrawingnetworkblocks

51

1 1 2 1 2 3

1 2 3 4 1 2 3 4 5


• Nativeimplementation2(savealltheintermediateresults)

o Bydefault,CURRENNTusesthisimplementationo Largermemoryconsumption(asaforementionedinpp.45)

CURRENNTWAVENET

Therefore,theduplicatedcomputationarguedinFastWaveNet paperisnotreallyaproblemforthenaïveCURRENNTimplementation.Itistheduplicatedcomputation+memoryconsumption thatmatters.

52

1

1

2

2

1 3

3

2 4

4

3

21 21 31

5

5

4

3 421


• FastWavenet implementation(savenecessaryintermediateresults)

o Theblueboxisthememoryspacetostoreintermediateresults.IntheFastWaveNet paper,thismemoryspaceisaqueue

o Thisideaavoidstheduplicatedcomputationo Itisalsoefficientinmemoryconsumption.Thememorysize

requiredonlydependsonthedilutionsize

CURRENNTWAVENET

53


• CURRENNTimplementation(alsosavenecessaryintermediateresults)o Memoryspace=(dilution+1)*feature_dimensiono DifferencefromFastWaveNet implementation:nobothertousea

fancyqueueasthedatabuffer.

o ComputationinadilutedCNNlayer:Initial the buffer by 0

For t = 0:T

calculate memory address ptr1 = [t % (dilution_size + 1)]

calculate memory address ptr2 = [(t + 1) % (dilution_size + 1)]

transform the current output of previous layer as (using CNN filters)

store at ptr1

calculate output by using the data from ptr1 and ptr2

o Note[(t+1) % (dilution_size+1)] = [(t-dilution_size) % (dilution_size+1)] o Thus,ptr2 isjusttheaddressoft-dilution_size ina

circularbuffer(althoughthebufferisjustaplainbuffer)

CURRENNTWAVENET

it

ptr1 ptr2

ot

ot

itit

time Nativecomputation2 CURRENTImplementation

0

GetPtr1=0%(2+1)=0GetPtr2=(0+1)%(2+1)=1Store atPtr1SumdatafromPtr1 andPtr2

1

GetPtr1=1%(2+1)=1GetPtr2=(1+1)%(2+1)=2StoreatPtr1SumdatafromPtr1 andPtr2

2

GetPtr1=2%(2+1)=2GetPtr2=(2+1)%(2+1)=0Store atPtr1SumdatafromPtr1 andPtr2

54

0 0

0

0 1 2


• CURRENNTimplementation,examplefordilution=2

CURRENNTWAVENET

o0

o1

o2

i0 i0

i0

i0

i1i1

i1 i2

i2

i0

o0

i0

o0

i0

o0

o1

i1

o1

i1

o2

i2

55


• CURRENNTimplementation,examplefordilution=2

• Theimplementationissimple,fast,andmemory-friendly

CURRENNTWAVENET

i3

time Nativecomputation2 CURRENTImplementation

3

GetPtr1=3%(2+1)=0GetPtr2=(3+1)%(2+1)=1Store atPtr1(overwritebuffer)SumdatafromPtr1 andPtr2

4

GetPtr1=4%(2+1)=1GetPtr2=(4+1)%(2+1)=2Store atPtr1(overwritebuffer)SumdatafromPtr1 andPtr2

o3

o4

i1 i2i3

i4 i2i3i4

i0

o0

i0

o0

o1

i1

o1

i1

o2

i2

o2

i2

o3

i3

o3

i3

o4

i4

56


• Moredetailsabouto SupposethisCNNlayerhas2outputchannels(sothedimensionof

outputfeatureis2andthenumberofCNNfilteris2)o Supposethedimensionofpreviouslayer’soutputis3

o SeeCNNLayer<TDevice>::computeForwardPass(const int timeStep, const int nnState)forthetwostepsabove

o PleasecheckslidesonCURRENNT_CNNforCNNimplementation

CURRENNTWAVENET

o4

i4 i2i3

CNN causal Filters

0

0

0

it

ot

0

0

0

Step2: data summation

Step1: matrix transformationi4 will be stored in the buffer

0 0 0

0 0 0

0

0

a

0

b

0

c

0

d

0

0

0

=

a4

i4

a4

a+c

b+d

CURRENNTWAVENETFinallyl WehavetrainedWaveNet-basedvocoder.Thenumberof

blocksis40.Pleasecheckthehttps://github.com/TonyWangX/CURRENNT_Recipes/tree/master/temp/compareTTS/WAVENET

l Waveformgeneratedgiventhenaturalacousticfeaturesisveryclosetothenaturalspeech

l Waveformgeneratedgiventhegeneratedacousticfeaturesachievesasimilarscoreascopy-synthesisspeech,althoughthespeakersimilaritydegrades

l So,pleasetrytousetheWaveNet onyourdatal Emailmeifyouhaveanyquestion

57