Version 2.0

Released Date 22/10/2021

Last Updated 13/12/2021

Gloria Corpas Pastor, Miriam Seghiri Domínguez, Romano Maggi

Holders of the exploitation rights
Universidad de Málaga


Despite repeated reference to the quality of being “representative”, constituting a “sample” and so forth as distinguishing features of corpora as opposed to other kinds of textual collections, there appears to be no consensus amongst the experts on this crucial issue. The size of the corpus is a decisive factor in determining whether the sample is representative in relation to the needs of the research project. However, even today the concept of representativeness is still surprisingly imprecise considering its acceptance as a central characteristic that distinguishes a corpus from any other kind of collection. It is therefore commonplace to come up against questions over the minimum number of texts that will guarantee that the sample taken is scientifically valid as well as debates over how to specify from what quantity it is possible to decide that the number of texts included, and therefore the number of words, is sufficient. Now, for the first time, corpus representativeness may be measured a posteriori by means of the N-Cor algorithm. “ReCor” is a computer application based on the N-Cor algorithm that calculates the minimum number of documents and words that should be included in specialised language corpora, in order that they can be considered representative. “ReCor” has been implemented in Java and it includes: a) Words (computing, reading and writing to files algorithms) b) Gui (graphical user interface) c) graphical window (adapter for graphical representation).


Before content body
News and Standingouts slideshow
After content body