https://www.carnot-tsn.fr/wp-content/uploads/2026/03/resultats.png
1337
2383
India Senouci
https://www.carnot-tsn.fr/wp-content/uploads/2021/09/logo-carnot-tsn.png
India Senouci2026-03-09 11:08:502026-03-17 16:24:09[BELLE HISTOIRE] Using AI to help detect breast cancerBnF combines sociology and big data to understand its Gallican users
November 3, 2016 - Big Data & AI

The Bibliothèque nationale de France (BnF) has always sought to know and understand its users. This is a particularly delicate task when it comes to studying the users of Gallica, its digital library. To get a better grasp of them, and not just by interviewing samples, BnF teamed up with Télécom Paris, thus benefiting from multi-disciplinary skills. To meet the challenge, the scientists are using the ITM's TeraLab platform to collect and process Big Data.
Often seen as a vector for technological innovation, could Big Data also represent an epistemological revolution? The use of mass data in the experimental sciences is not new, and has already proved its worth. But the human sciences are not to be outdone. In April 2016, the Bibliothèque nationale de France (BnF) leveraged its long-standing partnership with Télécom Paris (see box below) to conduct research on users of Gallica - its library of digitized documents freely accessible online. The methodology adopted is based in part on the analysis of large volumes of data collected when users log on.
Each time a user connects to Gallica, the BnF server records log data relating to all the actions carried out by the user. This information includes the pages opened by the "gallicanaute" on the site, the time spent on them, the links clicked on the page, the documents downloaded... Anonymized in accordance with the rules laid down by the Cnil, the logs thus constitute a veritable map of the user's journey, from his or her arrival on Gallica to his or her departure from the site.
With 14 million visits per year, this information represents a significant volume of data to be processed. All the more so as it must be correlated with the records of the 4 million documents that can be consulted - including document type, date of creation, author, etc. - which also contain important information for our customers. - which also contain important information for understanding users and their interest in a document. Merely carrying out sociological fieldwork, by interviewing a greater or lesser number of users, is not enough to capture the great diversity and complexity of today's web journeys.
Télécom Paris researchers have therefore adopted a multidisciplinary approach. Sociologist Valérie Beaudouin has joined forces with François Roueff, in a dialogue between sociological analysis of usage through field surveys on the one hand, and data mining and modeling on the other. " By adding this big data component , we can exploit information from logs and notices to determine typical behavioral profiles of gallicanautes" points out Valérie Beaudouin. The data is collected and processed on the ITM's TeraLab platform. This provides researchers with a turnkey working environment that can be customized as required, and offers more advanced functionalities than commercial data processing tools.
What are the profiles of gallicanautes?
François Roueff and his colleagues are developing unsupervised learning algorithms to extract behavioral categories from the mass of data. After six months' work, the first results are in. First of all, only 10-15% of Gallica users have a browsing habit involving the consultation of several digitized documents. The remaining 85-90% of users make one-off visits to a specific document.
"We're seeing some very interesting things in the 10-15% of Gallicanauts involved," notes François Roueff. " If we analyze Gallica connection sessions in terms of the diversity of document types (monographs, press, photographs, etc.), eight out of ten classes use only one type," he continues. This reflects a tropism on the part of users towards a certain media. When it comes to documenting, Gallican users are generally not very varied in the way they gather information. Some users will seek information on a given subject solely from photographs, others only from press articles.
According to Valérie Beaudouin, the heart of our work lies in understanding these behaviors. " Based on these results, we develop hypotheses, which must then be validated by cross-referencing them with other survey methodologies," explains the sociologist. Thus, data analysis is complemented by an online questionnaire to be filled in by Gallicanautes, by field surveys with users, or even by equipping certain users with cameras to monitor their activity in front of the screen.
"Field studies have shown, for example, that some Gallica users prefer to download documents for offline reading, while others prefer online consultation to benefit from zoom quality", she assures us. The Télécom Paris team has also noticed that to find a document on the digital library, some users prefer to use Google and associate the word "Gallica" with it, rather than use the site's internal search engine.
Validating hypotheses also requires close collaboration with teams at BnF, who provide knowledge of the institution and the technical tools available to users. Philippe Chevallier, head of research at BnF's Strategy and Research Department, testifies to the benefits of dialogue with researchers: "Through our exchanges with Valérie Beaudouin, we have learned to make the most of the information obtained by our community managers on the audiences engaged on social networks, or the user opinions sent in by e-mail.
Audience analysis: a real institutional challenge
The project has already made BnF aware of the resources at its disposal for user analysis. This is a further point of satisfaction for Philippe Chevallier, who is keen to see the project succeed. " This project is proof that knowledge of the public can be a research issue," he says passionately. " It's too important an issue for an institution like ours, so we have to devote time to it, and mobilize real scientific expertise to do so," he continues.
And for Gallica, the mission is even more crucial, as it is impossible to see a gallicanaute, where it is always possible to observe the majority profile of audiences at BnF's physical sites. "Many tools are now available to companies and institutions to easily capture online information on usage or opinions: e-reputation tools, audience measurement tools, etc. Some of these tools are interesting, but they're not enough. Some of these tools are interesting, but they offer few possibilities for controlling their methods and therefore their results. Our responsibility is to provide the Library with sufficiently rich and solid information on its audiences, and to do this, we need to collaborate with the world of research," insists Philippe Chevallier.
To obtain the right information, this project will run until 2017. The results will provide the cultural institution with avenues for improving its services. " We have a public service mission, making knowledge accessible to as many people as possible," reminds Philippe Chevallier. In the light of the researchers' observations, the question will be how to optimize Gallica. Who should be given priority? The minority of users who spend the most time on it, or the vast majority who use it only sporadically? Users with an academic profile - researchers, teachers, students - or the "general public"?
Until such time as BnF takes a position on these issues, the multi-disciplinary Télécom Paris team will need to continue its efforts to describe Gallicanautes. In particular, it will seek to refine the categorization of sessions by enriching them through semantic analysis of the records of the 4 million digitized documents. This will enable us to better identify, within the large volume of data collected, the themes to which the sessions relate. And the task poses modeling problems requiring particular attention, since the content of records is inherently inhomogeneous: it varies greatly according to the type of document and the conditions under which it was digitized.
Online audiences, a 15-year-old interest for BnF
The first survey carried out by BnF to identify its online audience dates back to 2002, 5 years after the launch of its digital library, and takes the form of a research project already crossing approaches (online questionnaire, log analysis, etc.). In the years that followed, the focus on digital users grew. In 2011, a survey of Gallican users involving 3,800 users was carried out by a consulting firm. Realizing that the study of audiences required more in-depth work, BnF turned to Télécom Paris in 2013 with a view to assessing the different approaches possible for a sociological analysis of digital uses. At the same time, BnF launched its first big data research project to measure Gallica's place on the French web during the Great War. In 2016, the sociology of online usage and experimentation with big data came together to create a project to understand Gallica's uses and users.















