Not long ago I read the paper Automatic Testing and Improvement of Machine Translation, which proposed an automatic testing method (called TransRepair) for machine translation models in the field of software testing. Below I will summarize what this paper has to say in several aspects.
Automatic Testing and Improvement of Machine Translation
TransRepair is a method for automatic detection and automatic repair of consistency problems in machine translation software.
TransRepair provides two methods of black box and gray box to repair the consistency problems of machine translation software.
The main steps of TransRepair are summarized as the generation of test cases, the generation of test criteria (Test Oracle), and the automatic repair process.
TransRepair provides a relatively clear, rigorous and detailed test case generation algorithm. Four quantification methods of sentence difference are used and compared respectively.
TransRepair provides a more complete and clear experimental scheme and diverse results.
TransRepair uses a similar structural consistency principle as the predicate.
My understanding of some key issues
- What is the consistency problem?
- How does TransRepair generate test cases?
- How does TransRepair verify that the output sentence pairs of the test cases are consistent?
- How to evaluate the experimental design of TransRepair?
- How does TransRepair's approach to thresholding differ from SIT?
- What is a black box for TransRepair's automatic repair? What is a grey box?
- What are the advantages of TransRepair? What's missing?
What is the consistency problem?
I personally summarize it as a machine translation software system for a set of sentences that differ only in specific words but have similar overall semantics and structures, the corresponding translation sentence set generated by a certain sentence or a few sentences in one or several of them Some semantic and structural inconsistencies occur.
How does TransRepair generate test cases?
TransRepair replaces some words in the input original sentence to form a mutation sentence group. For this operation, TransRepair uses a word vector model. For this model, I have a related understanding when I study the BERT model. For the word vector model, my personal understanding is: the word vector model quantifies the semantics of a word in a vector way and maps it to a certain vector space. Through the word vector model, we can use mathematical methods to study the semantics of words and the relationship between words. TransRepair obtains this correlation by calculating the distance between vectors. In order to avoid the problem of not considering "the individual word is in the whole sentence" encountered by the word vector model, after TransRepair finds the candidate word, it also brings it into the sentence for component analysis to determine the semantics and syntax of the sentence. whether major changes have occurred.
How does TransRepair verify that the output sentence pairs of the test cases are consistent?
The first is that TransRepair uses a set of algorithms to quantify the consistency of test cases. The algorithm first uses Widiff to perform a comparative analysis of the differences of string components. In order to enhance the reliability of the algorithm for the quantification of sentence similarity, TransRepair constructs a partial deletion set of the difference components involved in the original sentence and the translated sentence, and calculates the similarity by taking the elements between the elements of the two sets one by one. Take the largest similarity value among them. In this way, the change of similarity can be related to the replaced word as much as possible, and the influence of mutation of other individual words on quantification and discrimination can be reduced. TransRepair uses 4 different ways to quantify the similarity, some of which are also used in the previous SIT.
How to evaluate the experimental design of TransRepair?
The experimental design of TransRepair has its own characteristics. It firstly asks questions and conducts necessary discussions on how to solve them. Later experiments are designed around these four questions and give experimental data in appropriate form. The formulation of questions is logically interrelated and progressive, rather than discrete. Experiments demonstrate the effectiveness of the method from multiple perspectives, including the accuracy of detection, the effectiveness of detection, the effectiveness of repair, and the comparison with manual methods. The presentation of experimental data is intuitive and easy to understand.
How does TransRepair's approach to thresholding differ from SIT?
TransRepair uses artificial means to assist and statistical analysis. The key threshold for consistency judgment is to obtain the statistically optimal result by means of machine small-step traversal operations. The threshold setting logic is more convincing. Most of the threshold setting methods of SIT are given based on experience, which are not persuasive and operability.
What is a black box for TransRepair's automatic repair? What is a grey box?
The difference between the black box and the gray box of TransRepair's automatic repair lies in the sorting label and the structure of the repaired layer used by the sorting algorithm for the selection of the best repaired samples in the automatic repairing algorithm. The example corresponding to the black box is Google Translate. Since the software is not open source, its parameters related to input and output are unknown. Therefore, it can only make a fuss about its input and output itself or before or after its input. The example corresponding to the gray box is Transformer (this model I have contacted and understood before), its source code and training set are available, so we can get its grasp of the possibility of the result, because this is the neural network itself One of the accompanying parameters of the output. We can also perform repair operations on the model on the training set or even on the model structure.
What are the advantages of TransRepair? What's missing?
TransRepair has the advantage of automatic detection and automatic repair of consistency problems. Its method process and data, feasibility, accuracy, and reproducibility are high, which is closely related to the accurate definition of its implementation method, and its consideration and compensation for the defects and deficiencies of various existing methods used in the process. The disadvantage of TransRepair is its efficiency, and the effectiveness of its method is limited to consistency problems.