Structure-Invariant Testing for Machine Translation (SIT) paper reading summary

Not long ago, I read Structure-Invariant Testing for Machine Translation, a paper that presents a study on the detection of robustness problems in machine translation software systems. Below I will introduce some of my understanding from several aspects.

The main content of Structure-Invariant Testing for Machine Translation

SIT is mainly the study of the detection of robustness problems of machine translation software systems.
SIT is developed based on a metamorphic relation "structural invariance" in the metamorphosis test.
The main steps of SIT are to select original sentences, generate similar sentences, obtain results from translation software, analyze components and quantify sentence differences, and screen and find problems according to set thresholds according to the values.
SIT is more efficient and can process 2k+ sentences in 19 seconds. For Google/Bing Translate, there is an accuracy of 70%, and there is room for improvement in the accuracy (which may also be the reason for the selection of the threshold).

Understanding of several key issues

  1. Why do machine translation software have robustness issues?
  2. What is structural invariance?
  3. Why introduce structural invariance?
  4. How to use structural invariance to generate sentences that are semantically and syntactically similar?
  5. How to quantify the difference of sentences to determine whether the machine translation software system has a robustness problem?
  6. What are the advantages of SIT? What's missing?
  7. What applications can SIT have?

Why do machine translation software have robustness issues?

The methods or technologies related to deep learning are generally used in the construction of the core modules of the machine translation software system. The dimension of each layer in the deep learning model is generally high, which leads to a high probability that the definition of different label regions in the vector space of an insufficiently trained model is ambiguous. When the input value is near the boundary and then slightly changed, it has a probability of swinging near the boundary for a solution at a certain layer, causing the final output of the model to change drastically.

What is structural invariance?

Structural invariance, in my words, is that the semantic and grammatical structure of a sentence in a certain language is usually unchanged after it is converted into its corresponding translation after some specific and subtle word unit modifications. This property, in my opinion, is an entry point for research in this area (related to machine translation software systems) in an empirical and statistical sense.

Why introduce structural invariance?

I believe that structural invariance is one of the properties of input metamorphosis. It is introduced into this paper for two purposes in testing the robustness of software:

First, because the relationships and changes of natural language are complex, it is difficult to obtain a general test theorem as a benchmark for testing. Therefore, at this stage, the variables are controlled by this method, and a more correct standpoint in the empirical or statistical sense is obtained, and the test research is carried out. Second, the test cases related to natural language tests are difficult to construct manually. Introducing this property can facilitate us to generate a large number of test cases by using a small number of existing samples.

How to use structural invariance to generate sentences that are semantically and syntactically similar?

The BERT model is used in this SIT. I have contacted, used, and studied the relevant principles of the BERT model before.
SIT relies on BERT's large corpus training, as well as techniques such as masking and two-way feedback learning during training to suppress the semantic change of the entire sentence after word replacement, or it does not conform to grammar or usage habits. SIT assists in generating a candidate list of words to be replaced by adding a layer of lightweight Classifier after relying on BERT.

How to quantify the difference of sentences to determine whether the machine translation software system has a robustness problem?

SIT uses three methods to obtain numerical methods for sentence difference determination: string difference analysis, component parse tree analysis, and dependency parse tree analysis. SIT directly conducts the above three analyses on the results output by the translation software, and compares their effects respectively. In my opinion, from the data point of view, the three sentence difference analysis is too one-sided. The author can explore the method of combining these three methods in the next work.

What are the advantages of SIT? What's missing?

The author has discussed this aspect and will not repeat them here. I generally think that the superiority of SIT lies in its ability to detect various types of errors (untranslated, over-translated, misadjusted, unclear logic). However, I think its method of generating test cases and error quantification and detection is slightly rough, resulting in low accuracy under the experimental type. Its repair and threshold setting require manual participation, which is another major shortcoming.

What applications can SIT have?

SIT is mainly used for robustness testing of machine translation software systems using AI models. The robustness of machine translation software can be improved through the automatic detection of SIT and manual repair of training samples.
zh_CNZH-CN