Abstract
Computing a document description is the process of extracting concepts from digital files. This process implies solving several problems: analyzing the incoming characters, selecting the “correct” words, replacing some words by a more generic form, etc. This article describes how the document descriptions are build within the GALILEI framework regarding its
tensor space model.
An electronic document corpus is a set of digital files, each file being formatted with a given standard (PDF, HTML, XML, MS-Word, etc.). Most of these standards don’t aimed to propose a machine-understandable form of the knowledge encapsulates in the corresponding file. In fact, XML is the only widely used standard to provide some semantic information. When associated with the XML technologies (in particular the XML schemas), XML allows a computer program to do some reasoning on the content embodied in XML documents by using first-order logic. Beyond the structural limitations of first-order logic, the XML technologies don’t propose a direct solution to several information science related problems such as user’s interests profiling or document clustering.
The GALILEI project proposes a framework for these latest problems by providing an usable object modeling: the
tensor space model. The basic idea is to describe a document, a profile of interests, a topic, etc., as a tensor composed from multiple concept vectors, each vector being associated to a particular meta-concept (for example the textual body of the document or a particular metadata).
1 Document Analysis
The process of building a document description consists of analyzing a file formatted with a given standard in order to build a machine-understandable description of it (a tensor in the GALILEI framework). This process supposes two tasks:
File Extraction Starting from a particular document, the useful information to build the description must be identified. This implies at least that some textual streams are extracted. To associate them with multiple meta-concepts (such as metadata), it is necessary to understand the structural rules of that particular document.
Document Building Once some textual streams are available, eventually associated with a specific meta-concept, some treatments on them identify the “concepts”.
As I will detail it below, the Figure
1↑ shows that the description building in the GALILEI framework is divided into the following steps:
-
The lexical analysis of textual streams.
-
The treatment of the stopwords.
-
The selection of the words through thesauri (controlled vocabularies).
-
The stemming of the remaining words.
-
The selection of the concepts used to build the description.
The process may also construct an index for each document in order to provide a search-optimized description form. While the description is optimized to find all the concepts associated to a given document, the latest is optimized to search all the documents containing a given concept.
2 File Extraction
The role of the file extraction is not only to build textual streams that will be treated by the description building task, but also to provide some information on how to interpret a particular text stream (simple content, metadata, (sub)title, etc.).
2.1 Semantic Information
As explained earlier, most file standards don’t provide semantic information related to their content. At best, they propose to associate some metadata with a particular file (mostly the author, a description and some keywords). For example, the HTML format provides the <meta> tag to include some metadata to a Web page. Similarly, word processor formats (such as MS-Word or OpenDocument) associate several properties to a document, some of them being metadata. When analyzing such file formats, it is important to identify such metadata to enrich the document description. For the rest, these formats are composed from characters combined with structural rules without any semantic (to specify chapter titles, paragraphs, blocks in italic, etc.). In practice, they can be seen as one single textual stream to process.
For XML documents, the situation is more complex because the standard was defined to enrich the textual content with semantic rules reified as tags. Several remarks can be made:
-
the XML technologies are based on XML schemas proposing semantic models composed from semantic concepts (the tags);
-
the combination of some tags and their content provides useful information to represent XML documents (for example the well-known “metadata tags” of the Dublin Core MetaData Initiative (DCMI) [5]);
-
all the tags aren’t needed to catch the main semantic information contained in a XML document, since some of them serve structuring purposes only (they regroup several child tags related to a same kind of information) or give information such as the language used.
Once it is know that some semantic information can be extracted from a particular document, the next question is how to describe it with the
tensor space model. I propose the following simple ideas:
Semantic model It makes sense to describe a document in terms of the semantic models and concepts used to organize it. In fact, we may consider that each semantic model (such as a XML schema) is a meta type (the URI of the XML schema) and a meta-concept of that particular type (again, the URI of the XML schema). The meta-concept is then associated to a vector in the corresponding description whose elements define the discriminant importance of its semantic concepts (the tags in the case of XML schemas). A very simple method for weighting a given semantic concept,
, corresponding to a meta-concept,
, for a document,
is to count the number of occurrences of the corresponding tag.
Metadata Let us first remember that a metadata is a combination of a certain type (authors, title, keywords, etc.) and some content. It seems natural to consider that each metadata type is a meta-concept to which a vector representing its content is associated in the corresponding description. In practice, a metadata is therefore treated as normal text but associated to a meta-concept of the category “metadata” rather than “textual content”.
Finally, some XML tags may contain information not relevant from a knowledge point of view. Let us consider the following XML example:
<bibliography>
</bibliography>
If it is clear that the <author> and <title> tags provide useful information, the status of the <editor> tag is less clear. It makes therefore sense to define a list of tags which content should not be treated as textual content.
2.2 XML Metadata Detection
As explained above, not all the XML tags provide useful information for the corresponding document description. Ideally, the tags that correspond to some kind of metadata should be explicitly referenced as such. But in practice it isn’t always the case. I propose therefore a simple method to automatically detect which tags must be considered as “metadata tags”if this information is not available.
I suppose that, most of the time, such tags have small content (typically a few terms). I also consider that generic information is “on top” of the documents, which means that “metadata tags” are probably high in XML document hierarchies. Finally, I suppose that they don’t constitute the major part of the tags of an XML document and that they have no child tags.
Based on these assumptions, I propose the heuristic illustrated at Figure
2↑ to guess which tags are “metadata tags”. It consists in the verification of four constraints:
-
A “metadata tag” is “high” in the XML hierarchy (for example it has a depth of maximum 3 starting with depth 0 for the root node).
-
A “metadata tag” doesn’t appear too often in the whole document. In practice, the heuristic may limit the number of occurrences to an absolute value (for example less than 5 occurrences) or to a relative value (less than 5% of the total number of different tags in the document).
-
A “metadata tag” has no child tags.
-
A “metadata tag” contains a maximal number of terms (for example 20).
The reader should notice that a same tag may be considered as “metadata” or not inside the same document. In fact, except the second constraint, all the others are “local” to a particular occurrence. I made this choice to avoid to be too restrictive (all the occurrences must fulfill the criteria to be considered as “metadata tags”) or too permissive (if one occurrence fulfill the criteria, all the occurrences are considered as “metadata tags”).
2.3 Document Divisions Detection
Unlike semantic information, most file standards provide some structural and organizational information. In practice, they allow to specify a “style” to a block of text that fixes how it is presented, with several “styles” providing an implicit division information about the document (for example “title1”, “title2”, “paragraph”, etc.). HTML and of the word processor formats are such standards. Here is a HTML example:
<h1>Bourgeois and Proletarians</h1>
<p>
</p>
Other standards propose an explicit division of the document. Here is an example corresponding to the DocBook standard
[26]:
<chapter>
</chapter>
While these two cases seem at first glance similar, there is a difference: the DocBook standard delimits clearly the different parts of a document, while with the HTML standard we can only make suppositions (for example that the <p> tags following a <h1> tag correspond to the text of a part or chapter of a book). Table
1↓ shows how the tag depths differ from one file format to the other when the document structure is represented as a tree: with the DocBook standard the depths correspond to the division levels of the document, while for the HTML standard all the tags have the same depth.
|
|
DocBook
|
HTML
|
|
|
Tag
|
Depth
|
Tag
|
Depth
|
|
chapter
|
<chapter>
|
1
|
-
|
-
|
|
chapter title
|
<title>
|
2
|
<h1>
|
1
|
|
paragraph
|
<para>
|
2
|
<p>
|
1
|
Table 1 Depth differences: A first look.
So, to recognize different document divisions (which may be useful, for example to retrieve only a part of a document corresponding to a query), for some file formats (such as HTML), a heuristic is needed. The simplest one is to associate to each “division style” a given depth, and to suppose that each style applied marks the transition from one part to another. In the case of the HTML standard the array of styles is:
=[{<html>,0},{<body>,1},{<h1>,2},{<h2>,3},{<h3>,4},{<h4>,5},{<h5>,6},{<h6>,7},{<p>,8}]. But there is a still a problem. Let us take another HTML example to explain it:
<h1>Title 1</h1>
<p>This a first paragraph</p>
<h3>Title 1.1</h3>
<p>This is a second paragraph.</p>
Table
2↓ compares the depths associated with the different tags with the logical division levels that they represent in this particular example.
|
|
Depth
|
Logical Level
|
|
<h1>
|
1
|
1
|
|
<p>
|
8
|
2
|
|
<h3>
|
3
|
2
|
|
<p>
|
8
|
3
|
Table 2 Depth differences: A second look.
It is therefore necessary to deploy a solution a little more complex. Algorithm
1↓ shows such an heuristic. It uses two structures: a first one,
, represents a “division style” (a name and a corresponding depth) and a second one,
, correspond to a logical division (the associated style and the logical level).
represents a last-in last-out stack (where
get element on the top).
Require:
while (
):
if
:
while(
and
):
.Pop()
if(
):
.Push({
,0})
else:
.Push({
,
})
Algorithm 1 Document Divisions Detection.
2.4 A complete example
To illustrate the different points developed above, I propose an example. Since the XML standard offers the richest information, it will be used. Let us suppose the following XML document:
<?xml version="1.0"?>
<doc xmlns="http://mytags.org/" xmlns:dc="/">
</doc>
We may identify different elements:
-
The tags <dc:title> and <dc:creator> are defined as a metadata by the Dublin Core MetaData Initiative XML schema.
-
The tags <part> and <para> propose textual contents.
-
We may suppose that the <support> tag doesn’t provide any interesting information (at least regarding the knowledge embodied in the document).
|
Concept Category
|
Concept Type
|
Meta-concept
|
Content
|
|
Semantic
|
|
|
title creator
|
|
Semantic
|
|
|
doc part para support
|
|
Metadata
|
|
title
|
Document Title
|
|
Metadata
|
|
creator
|
|
John Cleese
|
|
Michael Palin
|
|
|
Text
|
content
|
body
|
|
Monty Python’s Flying Circus.
|
|
And now, something completely different!
|
|
Table 3 Example of file extraction.
Table
3↑ shows the result of a file extraction (
represents the URI of a XML schema). Five vectors should be created for the different elements of the document:
-
A vector for the XML schema of the Dublin Core MetaData Initiative.
-
A vector for the XML schema of the “mytags” XML application.
-
A vector for the metadata “title”.
-
A vector for the metadata “creator”.
-
A vector for the textual content of the document.
3 Lexical Analysis
The role of the lexical analysis is to find the sequences of characters that can be considered as valid tokens.
3.1 Token Categories
The simplest method to identify a token is to consider it as a sequence of characters delimited by spaces. But, in practice, it is somewhat more complicated.
Fox considers four choices for the establishment of tokens
[8]:
Digits Most numbers do not offer a semantic interest for the content of documents, so digits are normally not included in the tokens. However, sequences containing number may have a meaning as in “ISO9002”. One easy solution consists of accepting sequences containing numbers only if the first character is a letter. But “510B.C” may be an important expression too, so another solution is to remove all the sequences containing numbers unless they match some regular expressions representing specific sequences.
Hyphens When a sequence contains one or more hyphens a decision must be made. The first approach is to break the word into its hyphenated terms, the strings “state-of-the-art” and “state of the art” will be considered as identical for example, but, strings such as “Jean-Claude” or “B-52” have a meaning themselves, and in these cases it is better to consider them as an indivisible sequence. Also, hyphens are used to mark a single word broken into syllables at the end of a line. The best solution is to adopt a general rule and to specify a list of exceptions on a case by case basis.
Other punctuation Some punctuation marks are often used as parts of tokens. A dot is used in most file names as in “index.html” or may appear in names like “X.25”. Emails are also formed with punctuation characters, as in “pascal.francq@world.earth” for example. Generally, the punctuation is removed during the lexical analysis, i.e. “TCP/IP” is transformed into ”TCPIP”. However, the best solution is again to adopt a general rule and to specify the exceptions.
Case The cases of the letters mostly do not matter, so the lexical analysis converts the whole text into either the lower or the upper case. But some semantics might be lost during this conversion, the words “Bank” and “bank” can have two different meanings for example.
It is well known that implementing such rules is not difficult as such, but the cost in computing time of this step is directly dependent on the number of rules
[8]. That is why many systems avoid these text operations. Although no studies have been undertaken to evaluate the cost of these operations in information retrieval systems, it was shown that lexical analysis counts for more than 50 percent of the computational expense of a document analysis
[25].
3.2 Computational Linguistics
In most situations, a token correspond to a “normal” word (for example “connect”), a abbreviation or a name (such as “TCP/IP”). Certain character sequences, such as email addresses, are also considered as tokens. The computational linguistics community has proposed several enhancements, mostly based on natural language processing, to build expressions that express more complex concepts (for example “genetic algorithm”) or distinguish different significations of a same word. Thereafter, some examples of enhancements.
3.2.1 Syntactic and Semantic Labeling
The aim of the syntactic labeling is to associate to each word a syntactic context. For example, the French sentence “le système fonctionne correctement” can be analyzed as shown in Table
4↓.
|
le
|
article
|
singular masculine
|
|
système
|
noun
|
singular masculine
|
|
fonctionne
|
verb
|
singular third person
|
|
correctement
|
adjective
|
singular masculine
|
Table 4 Example of syntactic analysis.
To determine this association, most methods look at a few words at the left or at the right of the word to be analyzed. Some methods are based on a set of rules
[4], others on hidden
Markow models
[27].
Moreover, sometimes a same word has different meaning depending on a particular context, such as the word “bank” in the sentences “I am going to the English bank” and “I am fishing at the bank of the river”. It can therefore be interesting to also associate a semantic label to the different words
[29].
3.2.2 Syntagm Variant Recognition
It is possible to construct groups of words forming a same concept called
syntagms. When these syntagms are used for document indexing, some methods were developed to identify if a variation of it appears in a given document
[14, 13].
For example, if a syntagm was identified as “recording of an album”, these methods can find variants in documents such as “record an album” and “recording a live album”.
3.2.3 Anaphora Resolution
The resolution of anaphora consists to find relations between an entity and its references in a text
[12]. Some examples of anaphora are the texts:
-
The rock band has recorded a new album. It plans to make a new world tour.
-
Different stemming algorithms for words exist. Implementations of these algorithms can be found on the Web.
-
The formula one has win the race. This car is really the best in the paddock.
The resolution of anaphora can be very helpful in information extraction since it maintains a coherence among different parts inside a same text
[1].
3.3 Tokenizing Process
The tokenizing process works in two steps: it identifies first the character sequences that form textual units, and secondly it verifies if these units respect some constraints before considering them as valid token. The first step identifies the textual units following some rules:
-
A textual unit cannot start with a space or a punctuation character (defined by the classical C ispunct function).
-
A sequence of punctuation characters followed by a space cannot end a textual unit. So, “pascal.francq” is a textual unit while “pascal. francq” are two textual units (“pascal” and “francq”).
-
A textual unit always ends with a space.
Figure
3↓ shows that this first step works like a finite-state machine (the states are in normal text, the character types in italic).
One a textual unit is identified, it is converted to its lowercase equivalent. To decide if it is a token, two possibilities exist:
-
A textual unit is considered as a token if it is made up of letters only. This option ensures that tokens are definitively words. It has also the advantage to limit the “sizes” of the descriptions (thus less memory used and a reduced computational speed).
-
Some textual units containing non-alphabetical characters may pick up structured information like norm designations or email addresses. To eliminate those who are not useful, a textual unit must respect two constraints:
-
A given character cannot occur consecutively more than a given number of times (for example 3). This rule avoids to consider as token the textual unit “loool”.
-
The textual unit must contained a minimum percentage of letters (for example 30%). This rule eliminates the textual unit “B-1070” (16,6%) but not “pascal.francq@world.earth” (88%).
-
A given token cannot contain more than a maximum number of non-letter characters (for example 5). This rule recognizes as token the textual unit “iso6002” but not the token “0123456789a”
Currently, the tokenizer doesn’t include any computational linguistic methods. However, the integration of such methods is a goal for the future. A solution actually studied is to build a tokenizer around the
Unitex open source project.
4 Stopword Treatment
The stopwords are the most frequently occurring words of a given language which have a poor semantic content in a given context. For example:
English the, then, or, and, a, an, …
French le, la, or, non, car, …
4.1 Stopword Removal
It was recognized early on that many of the stopwords are not characteristic of the document contents
[17]. In fact, using such terms in search engines queries retrieves almost all of the documents because the discriminatory value of these stopwords is low
[20, 21]. Moreover, if all the stopwords are eliminated, it decreases the number of processed words by 20 or by 30 percent of the total number of words in English-language documents
[10].
The stopwords may be words other than articles, prepositions and conjunctions such as some verbs, adverbs or adjectives. For example, in the
Brown collection a list of 450 stopwords was defined and applied successfully
[10]. The stopwords may also be specific to a given context. For example, the word “poem” has probably no discriminatory value in a collection of documents on poetry and can therefore be eliminated. But such an approach is unusable for general collections.
Of course, the elimination of stopwords destroys some important information like “to be or not to be” in a document on Shakespeare. This is the reason why some systems use a full text index, i.e. all the words in the document are indexed. However, in the GALILEI project, they are removed (but this option can be disabled).
4.2 Document Language Detection
But the stopwords have another usage: they can determine the language of a document (this information is needed to apply a stemming algorithm as detailed in section
6↓). If some document formats indicate the language used, this information is not always accessible. It is therefore often necessary to guess the language by itself.
Let us define for each language a list,
, of stopwords,
. Moreover, if we define
as the number of occurrences of a stopword,
, in a document,
. We can build for each language a set of stopwords,
that appear in
:
Moreover, let us define
as the ratio between the number of different stopwords in a language appearing in the document,
, and the total number of different words (tokens containing only letters),
:
Since the stopwords represent the most frequently used words in a every language, their distribution in a particular document gives a clue about the language used. In fact, two conditions are used to assign a language to a document:
-
The document must contain a reasonable number of different stopwords. If a document contains only the stopwords “no” and “yes” for example, it is probably not an English-language document. This can be expresses with the
ratio:
where
is a given threshold
-
Since the stopwords constitute a large amount of the words found in a document, we may suppose that the correct language is the one having the most stopwords in the document:
where
represents the set of all languages respecting the first condition for document
.
When a document is updated, i.e. its content is modified, a new analysis is necessary. Most of the time, the modifications to the content does not imply a change in the language, i.e. this step is not necessary. On the other hand, there are no methods to check this assertion other than to determine the language again. So, a decision is to decide whether the language has to be determined each time that a document is analyzed, or whether the languages are considered to be static, i.e. a document keeps its language once it is known.
Table
5↓ shows the average, the standard deviation, the minimum and the maximum of the value of ratio
for documents in different languages with respect to English and French. As can be seen, the difference between the average ratio for the correct language and for the other one is significant. In practice, a good value for
is
for both languages.
|
|
French documents
|
English Documents
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 5 Statistics on the ratio
.
Remark: The XML specification defines an attribute, xml:language, which can be used to specify the language of a portion of a document, i.e. XML documents may contain different parts in different languages. In fact, every document may contain parts written in different languages. Actually, it is here supposed that only one language is associated with each document.
5 Thesauri
Unlike stopwords, some words have a high semantic content in a given context. In a corpus of newspaper articles for example, some names (such as politicians) are more discriminant than “normal” words. A thesaurus is a set of words with a high discriminatory value in a fixed domain, and for each word there is a list of related words which are commonly derived from synonymy relationships. Thesauri are used in information retrieval for different reasons
[7], such as:
-
to provide a standard vocabulary for a given system for searching and indexing;
-
to assist users to find the correct formulation for their interest by extending their queries to include related terms;
-
to provide a classification of words and to introduce a notion of neighborhood between them.
In general, the structure of a thesaurus is much more than a simple list of nouns and their synonyms. Sometimes a word has a different meanings in different contexts. For example, the meaning of the word
genetic is not identical in the context
medical specialist in genetics and in the context
genetic algorithms. This is the reason why some thesaurus add to the definition of a word in a thesaurus a contextualized explanation, such as
genetics (specialist, medical) and
genetic (algorithms) [24].
A popular thesaurus is WordNet
[6]. It is an on-line lexical reference system. English nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. Different relations link these synonym sets. A result of an on-line query on the word “genetic” is given below.
The adjective "genetic" has 4 senses in the online WordNet in 2012.
-
-
(adj) familial, genetic, hereditary, inherited, transmitted, transmissible (occurring among members of a family usually by heredity) "an inherited disease"; "familial traits"; "genetically transmitted features"
-
(adj) genic, genetic, genetical (of or relating to or produced by or being a gene) "genic combinations"; "genetic code"
-
(adj) genetic (pertaining to or referring to origin) "genetic history reconstructs the origins of a literary work"
-
(adj) genetic, genetical (of or relating to the science of genetics) "genetic research"
The user of WordNet can then search for synonyms and related nouns. An important aspect of WordNet is the Global WordNet Association whose goal is to coordinate the extension of WordNet to other languages and to develop a standardized index of sense-distinctions or so-called Inter-Lingual-Index that can be used as a universal and cross-lingual standard for sense differentiation.
More recently, a lot of ongoing research tries to use the free online encyclopedia
Wikipedia as thesaurus. One of the simplest idea is to suppose that the
Wikipedia page titles form a set of concepts, while their links implement relationships between them. Using
Wikipedia has multiple advantages:
-
it is a freely accessible online to anyone;
-
it contains millions of concepts (articles) in different languages with links between them;
-
it is updated with a high frequency.
In practice, a thesaurus acts as a filter for the tokens: all the tokens that don’t correspond to an entry in the thesaurus are removed. Moreover, it also possible to replace a set of words by an unique one using synonym relationships for example (and adding the corresponding occurrences).
Actually, the GALILEI project doesn’t provide any thesaurus, but it is planned to develop something around the Wikipedia encyclopedia in a near future.
6 Stemming
When analyzing the list of words contained in a document, it immediately appears that many variants of the same word occur in the document, through plurals or past tense suffixes for example. To avoid processing all these variants as different tokens, a solution is to replace all these variants by their stems. A stem is the portion of a word which is left after removing its prefixes and suffixes. An example is “connect” which is the stem for “connection”, “connections”, “connected” and “connecting”. In practice, different tokens pointing to the same stem are regroup in a single one (and the corresponding occurrences are added). Another advantage is therefore that the number of indexed tokens decreases as a result of this operation.
To implement this stemming, one type of algorithms removes the suffix and/or the prefix of a word following a set of rules to give a stem, with only the first rule applicable being used. An example of such a rule is given by
Harman with the suffix removal plurals
[11]:
If a word ends in “ies” but not “eies” or “aies”,
replace “ies” by “y”.
If a word ends in “es” but not “aes”, “oes”, or “aes”,
replace “es” by “e”.
If a word ends in “s” but not “us” or “ss”,
remove “s”.
Several different stemming algorithms have been compared to evaluate the advantage of the use of stemming algorithms
[9]. Most of the time, general stemming has either a positive or no effect at all on retrieval performance if the process is measured in terms of quality. But since there are still conflicting studies, many Web search engines do not included these algorithms.
This approach is, of course, language dependent, i.e. a specific algorithm must be developed for each language. Moreover, it is impossible to determine the set of rules necessary to handle all the possibilities of a language. This means that some imperfections cannot be avoid, either that a common stem between several words isn’t found, or different words point wrongfully to a same stem. Several stemmers exist. For English,
Porter’s algorithm is the most widely used due to its compactness
[19]. The GALILEI team developed the
Carry algorithm for French
[18], and a stemmer for Arabic. Moreover, the
Snowball stemmers and resources page proposes several stemmers for multiple languages which are available as open source libraries in C and Java (and integrated in the GALILEI project).
7 Concept Selection
It is evident that, when the number of different concepts increases, the amount of memory and the computational time needed by the algorithms also increase. It can therefore be interesting to introduce a step involving the reduction of dimensionality. The idea is to transform the initial set of concepts,
, in another space,
, called a reduced term set, where
.
7.1 Concept Filtering
A first category of concept selection methods proposes to filter the concepts (mostly words) that can be used to describe documents. The idea is to define some criteria to avoid that certain concepts are used for the document description.
One of these is the identification of “nominal groups”
[3]. The main idea is to suppose that most of the semantic content of a document is represented by nouns. Only nouns are therefore used as concepts, i.e. verbs, adjectives, adverbs, articles and pronouns are omitted. The next step is based on syntactic distance, which is the distance between two tokens, and consists of using nominal groups. A nominal group is a set of tokens whose syntactic distance in a document does not exceed a given threshold, for instance 3. Finally, the document is just described by the nominal groups it contains.
Another possibility is to use only the concepts that appear in the greatest number of documents. Some experiments shown that it is possible to reduce by a factor 10 the number of concepts without any loss in effectiveness, and that a reduction by 100 implies only a small loss
[30]. This may be surprising since the [concept weighting method
↓] is selected to allocate more importance to the concepts appearing many times in a small subset of the collection. But it is well-know that the great majority of concepts appearing in collections have a very low frequency
[22].
In the field of text categorization, other methods for concept filtering were developed. They are based on an existing training document set and the occurrences of the concepts in these documents. One of these methods is to remove all the concepts appearing in at most
training documents
[16], where the most used values are
. Another method is to remove all the concepts that appear at most
times in the training set
[2], where the values used are usually
. The fact that the filtering criteria are based on a training set should lead to more “intelligent” filtering. Furthermore, the filtering criteria are independent of a particular document, which means that these criteria do not change for each new document, and that it is therefore not necessary to recompute everything.
Ideally, these latest techniques require training sets and, for some methods, when new types of documents appear in a collection, it is necessary to recompute the filtering parameters and re-index the whole corpus. Yet, the idea that a concept should appear often in documents in order to propose a useful description can be easily implemented: it suffices to remove the tokens that have a number of occurrences below a given threshold (for example 2). It is also possible to remove all the “small tokens” (for example with one or two characters only).
7.2 Concept Space Transformation
Starting from the original set of concepts, some methods try to build up a reduced set of “synthetic” concepts that better describe the document content. It is important to understand that these artificial terms do not correspond to “normal concepts” (such as words), but are more related to topics contained in document collections. In fact, it is a transformation of the concept space: the concepts extracted from documents are replaced by new ones.
The first category of such methods is
concept clustering. The methods try to group together concepts with a high degree of semantic similarity. Once these groups have been computed, the methods use the groups, or a representation of them, to describe document content. There are many techniques that can be used to construct these groups. For example, in the context of the GALILEI project, we develop a method that regroups concepts by combining their co-occurrences, their confidences and their syntactic distances
[15]. Some experiments show that it performs a little bit better than using textual concepts only.
Another category is
latent semantic indexing. The technique developed for the vector space model propose to compute new description vectors obtained from the original ones with a
singular value decomposition of the concept-document matrix. It is necessary to fix the number of dimensions that should be used to construct these new vectors. In practice, 200 dimensions are generally used. This method has been tested on text categorization and gives better results than with textual concept selection methods
[28]. Other experiments show that latent semantic indexing is more effective for text categorization than concept filtering
[23]. A disadvantage of latent semantic indexing is that it is difficult to interpret the “synthetic” concepts in the new space.
The space transformation methods share a drawback with some concept filtering ones: they are expensive regarding the computational cost. In fact, a first analysis of the whole corpus is needed and each modification in the corpus may imply a complete re-computation (this is particularly the case with the latent semantic approach). One solution could be to make this first analysis on a training document set if it exists, but due this disadvantage, the GALILEI project doesn’t provide actually such methods.
8 Implementation
The whole document analysis process described in the previous sections can be resume in three task categories:
-
Extract textual streams associated with semantic and division information from the document to analyze.
-
Extract textual tokens from the textual streams.
-
Analyze the textual tokens to decide which ones must be removed (for example stopwords), replaced (such as stems replacing words) or added (during a concept space transformation for example), i.e. choose the textual tokens that will become concepts describing the document.
In the GALILEI platform, it was chosen that these tasks are executed by plug-ins. The whole process is controlled by the
GDocAnalyze class as shown at Figure
4↓. It maintains a list of tokens (represented by the GTextualToken class) and, for each token, the corresponding concept and its occurrences (the position in the file, the depth in the document structure and the vector (meta-concept) associated with a particular occurrence).
When a document should be analyzed, its URI is passed to the
GDocAnalyze class. If the URI refers to an online document, it is downloaded in a temporary file. The document to analyze is therefore always available as a local file. The task categories involve the following classes:
-
The GFilter class provides a generic plug-in that extracts textual streams from a local file. In practice, each plug-in provides a filter that can handle a set of MIME types (for example “text/email” or “application/msword”). The GDocAnalyze class searches the filter corresponding to the MIME type of the document to analyze, and passes it the local file name (if no filter is found, the document cannot be analyzed). The GDocAnalyze class provides several methods that can be used by a filter, the most important one being:
ExtractBody Extract some textual content and associate it to the default textual content corresponding to the meta-concept “body”.
ExtractDMCI Extract some textual content and associate it a metadata defined by the Dublin Core MetaData Initiative (DCMI)
[5].
ExtractTextual Extract some textual content and associate it to a meta-concept to specify.
AddConcept Associate a concept to a meta-concept to specify.
-
The GTokenizer provides a generic tokenizer plug-in that extracts tokens from textual streams. The GDocAnalyze class passes the textual contents to the current tokenizer plug-in which communicates the tokens extracted trough the method GDocAnalyze::AddToken (if not tokenizer is selected, a document cannot be analyzed).
-
The GAnalyzer provides a generic plug-in to analyzes the textual tokens. In practice, The GDocAnalyze class calls all the active analyzers in a specific order (for example to ensure that the plug-in dealing with stopwords and determining the document language is called before the one that does the stemming which is language-dependent). At least one analyzer should associate to the tokens a concept through the GTextualToken::SetConcept method. This can be done with the simple following code:
GConceptCat* TextCat(Session->GetInsertConceptCat("Text"));
GConceptType* TermsSpace(Session->GetInsertConceptType(TextCat,"Terms","Terms"));
GConcept* Concept(TermsSpace->GetInsertConcept(Token->GetToken()));
Token->SetConcept(Concept);
Once the analysis is done, the
GDocAnalyze class builds the
document tensor and updates the document index with all the tokens associated with a given concept, the corresponding weights in each vector being their number of occurrences.
Several subversion repositories provide plug-ins for the different tasks:
filters The repository proposes GFilter plug-ins for different document format types (emails, MS-DOC, PDF, HTML, XML, etc.).
langs The repository offers a list of plug-ins for the different languages, each plug-in providing a stemming algorithm and a list of stopwords (English, French, Dutch, German, Italian, Arabic, etc.).
textanalyze The repository provides plug-ins that implements the different steps described above: the lexical analysis, the stopword treatment, the stemming and the concept filtering (which also associate tokens to concept).
References
[1] Approaches to Discourse Anaphora: Proceedings of the DAARC Colloquium (Botley, Simon Philip, ed.). 1996.
[2] L. D. Baker, A. K. McCallum: “Distributional clustering of words for text classification”, , pp. 63—103, 1998.
[3] J. Broglio, J. P. Callan, W. B. Croft, D. W. Nachbar: “Document retrieval and routing using the INQUERY system”, Overview of the Third Retrieval Conference (TREC-3), pp. 29—38, 1995.
[4] Jean-Pierre Chanod, Pasi Tapanainen: “Statistical and constraint-based taggers for French”, , 1995.
[5] DCMI: Dublin Core Metadata Element Set, Version 1.1: Reference Description. 1999. URL: http://dublincore.org/documents/dces.
[6] Christiane Fellbaum: WordNet in Theory and Applications of Ontology: Computer Applications (Poli, Roberto and Healy, Michael and Kameas, Achilles, ed.). Springer Netherlands, 2010. URL http://www.springerlink.com/content/n2516j53k5p26x76/.
[7] D. J. Foskett: Thesaurus in Readings in Information Retrieval (Jones, K. Sparck and Willet, P., ed.). Morgan Kaufmann Publishers, Inc., 1997.
[8] C. Fox: Lexical Analysis and Stoplists in Information Retrieval: Data Structures & Algorithms (Frakes, William B. and Baeza-Yates, Ricardo, ed.). Prentice Hall, 1992.
[9] Willian B. Frakes: Stemming Algorithms in Information Retrieval: Data Structures & Algorithms (Frakes, William B. and Baeza-Yates, Ricardo, ed.). Prentice-Hall, 1992.
[10] W. Francis, H. Kucera: Frequency Analysis of English Usage. Houghton Mifflin, 1982.
[11] D. Harman: “How Effective is Suffixing?”, Journal of the American Society for Information Science, pp. 7—15, 1991.
[12] Graeme Hirst: Anaphora in Natural Language Understanding: A Survey. Springer-Verlag, 1981.
[13] Christian Jacquemin: “What is the tree that we see through the window: A linguistic approach to windowing and term variation”, Information Processing & Management, pp. 445—458, 1996.
[14] Karen Sparck Jones, John I. Tait: “Automatic search term variant generation”, Journal of Documentation, pp. 50—66, 1984.
[15] Nicolas Kumps, Pascal Francq, Alain Delchambre: “Création d'un espace conceptuel par analyse de données contextuelles”, Proceedings of the JADT 2004, pp. 1185—1194, 2004.
[16] Y. H. Li, A. K. Jain: “Classification of text documents”, Computer Journal, pp. 537—546, 1998.
[17] H. P. Luhn: “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”, IBM Journal of Research and Development, pp. 309—317, 1957.
[18] Marjorie Paternostre, Pascal Francq, Marco Saerens, Julien Lamoral, David Wartel: Carry, un algorithme de désuffixation pour le français, 2002. URL: http://www.galilei.ulb.ac.be.
[19] M. F. Porter: “An Algorithm for Suffix Stripping”, Program, pp. 130—137, 1980.
[20] C. J. van Rijsbergen: Information Retrieval. Butterworths, 1979.
[21] G. Salton, M. McGill: Modern Information Retrieval. McGraw-Hill Book Co., 1983.
[22] G. Salton, A. Wong, C. Yang: “A vector space model for automatic indexing”, Communications ACM, pp. 613—620, 1975.
[23] H. Schütze, D. A. Hull, J. O. Pedersen: “A comparison of classifiers and document representations for the routing problem”, , pp. 229—237, 1995.
[24] Dagobert Soergel: Indexing Languages and Thesauri: Construnction and Maintenance. Melville Publishing Co., 1974.
[25] W. Wait: “The Cost of Lexical Analysis”, Software Practice and Experience, pp. 473—488, 1986.
[26] Norman Walsh, Leonard Muellner: Docbook 5.0: The definitive guide. O'Reilly, 2008.
[27] Ralph Weischedel, Marie Meeter, Richard Schwartz, Lance Ramshaw, Jeff Palmucci: “Coping with ambiguity and unknown words through probabilistic models”, Computational Linguistics, pp. 359—382, 1993. Special Issue on Using Large Corpora: II.
[28] E. D. Wiener, J. O. Pedersen, A. S. Weigerd: “A neural network approach to topic spotting”, , pp. 317—332, 1995.
[29] Yorick Wilks, Brian M. Slator, Louise Guthrie: Electronic Words — Dictionaries, Computers and Meanings. MIT Press, 1996.
[30] Y. Yang, J. O. Pedersen: “A comparative study of feature selection in text categorization”, , pp. 412—420, 1997.