Facebook's TransCoder AI 'Bests' Commercial Rivals Translating Between Code Languages
Researchers at Facebook say they've developed a new system called a neural transcompiler capable of converting code from one high-level programming language like Java, Python, or C++ into a different code, according to a study posted on a preprint website.
Facebook AI researchers create inter-code translation system
The system is unsupervised, which means it seeks previously undetected patterns in data sets without guiding labels and a minimal degree of human supervision, reports Venture Beat.
Notably, it reportedly outperforms rule-based guidelines other systems use for code translation by a "significant" margin.
"TransCoder can easily be generalized to any programming language, does not require any expert knowledge, and outperforms commercial solutions by a large margin," wrote the coauthors of the preprint study. "Our results suggest that a lot of mistakes made by the model could easily be fixed by adding simple constraints to the decoder to ensure that the generated functions are syntactically correct, or by using dedicated architectures."
Moving an existing codebase to a modern and more efficient language like C++ or Java takes serious expertise in both source and target languages — a typically pricey process. Commonwealth Bank of Australia spent roughly $750 million in a five-year timespan to convert its platform from COBOL to Java script. While Transcompilers are technically of help here — they cut out the need to rewrite new code from scratch — they're also difficult to build because disparate languages have varying syntax and use distinctive platform APIs, variable types, and standard-library functions, reports Venture Beat.
Facebook's New TransCoder system
Called TransCoder, Facebook's new system can translate between Java, C++, and Python — completing difficult tasks without the supervision such projects typically require. The new system is first initialized with cross-lingual language model pretraining — a process that maps partial code expressions whose meanings overlap to identical representations independent of programming language.
Input source code sequences are masked out on a random basis, and TransCoder is tasked with predicting which masked-out portions are which based on larger contexts.
The process — called denoising auto-encoding — trains TransCoder to generate valid sequences, even when noisy input data is provided. Then back-translation allows TransCoder to generate parallel data later used for additional training.
TransCoder's cross-lingual training comes from how many common tokens — also called anchor points — exist across various programming languages. These come from common keywords like "while," "try," "for," and "if," in addition to digits, English strings, and mathematical operators that show up in the source code.
Back-translation helps the system improve code translation quality by pairing a source-to-target model with a "backward" target-to-source model trained in the opposite string direction. The target-to-source model is used to translate target sequences into source language, which creates noisy source sequences — while the source-to-target model reconstructs target sequences from noisy sources until the two mutually-inverted models converge into one.