BERT and the French Language: an Overview of Various Initiatives

article NLP

« In the last few years, a lot of research has been made in the area of Natural Language Processing (NLP). This field of Artificial Intelligence (AI) has a huge potential with various applications to our everyday life. Pre-trained language models recently established a new state-of-the-art in Natural Language Processing achieving better performances. These models take advantage of the huge amount of unlabeled text available to find the best representation of words. They can also be fine-tuned with more specific datasets to be used on particular NLP problems.

In this paper we focused on BERT, which is one of the pioneer algorithms in this area. Its ability to extract the context of words to better understand their meaning has revolutionized the field of Natural Language Processing. We studied it and became interested in its use in the French language. CamemBERT and FlauBERT are to date the only two models using the BERT architecture and exclusively designed for French. In this paper, we present the theory of these models, explain how they work, and finish by presenting a comparison of them.

Today, the quantitative explosion of data has given us new ways to see the world. The huge volume of data available implies some change about capture, storage, search, sharing, analysis, and visualization of data. The processing of big data allows new possibilities of information and data exploration, which comes from many digital sources.

Companies are overwhelmed by the flood of data. In other words, they are compelled to acquire relevant information to develop their high value-added strategies in order to master the ever-changing environment. From now on, the management of industrial strategies is largely based on the ability of companies to access strategic information to enhance the value of their capital. This information can therefore be the source of new knowledge. Natural Language Processing (NLP) is currently one of the most popular big data methods and one of the challenges of collecting information.

Indeed, the number of NLP patents for devices such as voice assistants, machine translation, and chatbots have grown at an average annual growth rate of 44% over the past five years, now reaching more than 3, 000 publications per year according to the 2017 annual report of the Tech research firm, Lux Research.

The current paper is organized following four sections where section 2 summarizes the history of NLP, the main steps of a typical implementation, and the details of theoretical methods. Section 3 is focused on the recent development of BERT architecture, followed by section 4 for its adaption to the French language with comparison. Finally, section 5 draws the conclusion ».