SLT-Neural Sign Language Translation

1. Differences between SLR and CSLR

- sign languages have their own specific linguistic rules [55] and do not translate the spoken languages word by word.
- sign language glosses give the meaning and the order of signs in the video, but the spoken language equivalent (which is what is actually desired) has both a different length and ordering.

2.1 Dataset

The collection and annotation of continuous sign language data is a laborious work.

  • weakly annotated and lack the human pose information
  • learning from weak data
  • human pose estimation [8]

2.2 sign language recognition

  • handcrafted intermediate representations
  • the temporal changes in these features have been modelled using classical graph based approaches:
  • Hidden Markov Models (HMMs)
  • Conditional Random Field
  • template based methods
  • CNN for feature representation
  • RNN for temporal modelling: [36][17]

2.3 seq2seq learning and neural machine translation

  • Connectionist Temporal Classification (CTC) Loss [25], which considers all possible alignments between two sequences while calculating the error.

CTC is not suitable for machine translation
ctc assumes source and target sequences share the same order. Furthermore, CTC assumes conditional independence within target sequences, which doesn’t allow networks to learn an implicit language model.

  • Encoder-Decoder + attenton: This part is already familiar

3. Neural Sign Language Translation

  • input: sign video x = (x1, x2, ..., xT) with T number of frames
  • target: a spoken language sentence y = (y1, y2, ..., yU) with U number words
  • T >> U

note: - the alignment between sign and spoken language are unkonwn and nonmonotonic
- the source side is video

3.1 Model Architecture

CNN with attention-based encoder-decoder:

3.2 Spatio and word embedding

  • Transformering the sparse one-hot vector to a dense form, these embeddings can be learned from scratch or pretrained on larger dataset.
  • contrary to text, sign are visual. So we need to learn spatial embeddings to represent sign videos.

source side spatio embedding: 2D CNN \[f_t=SpatioEmbedding(x_t)\]

target side word embedding: fully connected layer \[g_u=WordEmbedding(y_u)\]

3.3 Tokenization Layer

This module is controversial, and it shows that it is of much research value. How to split sign language video:
- frame level
- gloss level, exploiting an RNN-HMM forced alignment approach [36]

3.4 Attention-based Encoder-Decoder Networks

This module is the same as in machine translation.

Problem: The feature representation \(f_t\) of one frame or gloss, is a one dimension vector?

5. Dataset

The difference between PHOENIX14T and PHOENIX14:

Due to different sentence segmentation between spoken language and sign language, it was not sufficient to simply add a spoken language tier to PHOENIX14. Instead, the segmentation boundaries also had to be redefined. Wherever the addition of a translation layer necessitated new sentence boundaries, we used the forced alignment approach of [35] to compute the new boundaries.

6. Quantitative Experiments

  • Gloss2Text (G2T): having a per- fect SLR system as an intermediate tokenization.
  • Sign2Text (S2T): end-to-end pipeline translating directly from frame level sign language video into spoken language.
  • Sign2Gloss2Text (S2G2T): uses a SLR system as tokenization layer to add intermediate supervision.

The difference between S2G2T with G2T is that S2G2T is an end-to-end system.

For our S2G2T experiments we use the CNN-RNN-HMM network proposed by Koller et al. [36] as our Tokenization Layer, which is the state-of-the-art CSLR.

Obviously, the performance of S2T is much lower than G2T or S2G2T. And the author gave some possible reasons:
- the number of frames in a sign video is much higher than the number of its gloss level representations, our S2T networks suffer from long term dependencies and vanishing gradients.
- the dataset we are using might be too small to allow our S2T network to generalize considering the number of parameters (CNN+EncoderDecoder+Attention)
- expecting our networks to recognize visual sign languages and translate them to spoken languages with single supervision might be too much to ask from them.

compared to the S2T network S2G2T was able to surpass its performance by a large margin, indicating the importance of intermediary expert gloss level supervision to simplify the training process.

7. Qualitative Experiments

- The S2T network’s focus is concentrated primarily at the start of the video, but attention does jump to the end during the final words of the translation.
- In contrast the S2G2T attention figure shows a much cleaner dependency of inputs to outputs. This is partly due to the intermediate tokenization removing the asynchronic- ity between different sign channels.

The paper needed to read in references:

  • [55] W. C. Stokoe. Sign Language Structure. Annual Review of Anthropology, 9(1), 1980.
  • [8] Z. Cao, T. Simon, S.E. Wei, and Y. Sheikh. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [36] O. Koller, S. Zargaran, and H. Ney. Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [17] R. Cui, H. Liu, and C. Zhang. Recurrent Convolutional Neu- ral Networks for Continuous Sign Language Recognition by Staged Optimization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [25] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ACM International Conference on Machine Learning (ICML), 2006
  • [35] O. Koller, H. Ney, and R. Bowden. Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data Is Continuous and Weakly Labelled. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [6] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden. Sub- UNets: End-to-end Hand Shape and Continuous Sign Lan- guage Recognition. In IEEE International Conference on Computer Vision (ICCV), 2017