Tensor Space Model

Pascal Francq

January 14, 2012 (April 20, 2011)

Abstract

In information science related problems (such as document clustering), the objects are generally described with the vector space model. This is especially true for poorly semantically structured documents (for example PDF). This approach doesn’t take other types of information into account (for example metadata or the structure of XML documents). In this article, a tensor space model is presented. It is a simple adaptation of the extended vector model.

1 Introduction

Despite the fact that most documents available today are poorly semantically structured, solutions were developed to overcome this. For years, document management systems add metadata to the documents stored, and several document types propose them in some way (headers of emails, <META> tag in HTML or the “properties” of office documents). More recently, the W3C introduces the XML as a set of technologies to structure documents.
Today, more and more documents in intranets are XML ones, and their number increases on the Internet too. Despite the numerous existing XML databases, there is a lack of tools using the structure to solve several classical problems such as keyword-based queries for information retrieval or clustering of XML documents. Moreover, it is still difficult to manage collections (such as the Internet) where XML documents are mixed with non-XML documents (such as Postscript, PDF, etc.).

1.1 A First Example : emails

A simple email is a text document divided into two parts (generally separated by a blank line): the header and the content. The following text proposed a simple example of an email:
From: pascal@source.org (Pascal Francq)
Subject: Example of an email
Article-I.D.: src.17194 
Date: Thu, 20 Jul 2006 10:08:50 +0200
Distribution: Paul Otlet Institute

Do you buy the remaster version of the In Rock album from Deep Purple?
Pascal
The header contains several fields such as the author (From), the subject or the date when the email was send. This fields are non only structured metadata that enhances the content of the email, but it provides also information that links emails (for example having the same author or sharing a same subject [A]  [A] To detect that, it is often necessary to ignore leading “Re:” in the subject field.) .

1.2 A Second Example : XML Documents

XML documents are characterized by text-oriented content structured with tags and attributes [22]. The following text proposed a simple example of an XML document:
<?xml version="1.0" standalone="no"?>
<discography xmlns="http://music.org/">
<group groupID="Deep Purple">
   <groupName> Deep Purple </groupName>
   <album>
      <albumName> In Rock </albumName> 
      <published> 1969 </published> 
   </album>
</group>
</discography>
An XML structure is a tree containing four different types of tokens:
tags Elements structuring the document.
attributes Elements related to tags and influencing the structure.
attributes values Most attributes are associated with a given value (for example groupID=”Deep Purple”).
content The textual content corresponds to the “knowledge” encapsulated into XML documents.
In an XML document, knowledge expressed textually is associated to tags and attributes that are supposed to add semantic information to the document. The structures of XML documents (the tree of tags, attributes and attributes values) are defined through XML schema. Every XML schema can be build upon existing XML schema. In the emerging semantic Web [4], the documents depend from application-specific XML schema (for example to represent a technical report) build on top of generic XML schema (for example to represent a person or the abstract of a document). In this context, XML documents with similar tags and attributes (using elements of common XML schema) are supposed to be related to “something common” (from the content or the functional point of view). For example, in a corpus of XML documents dedicated to clinical tests, similar XML parts (tags and content) may appear in different documents related to the same authors (using tags from the Dublin Core Metadata [9]) or the same chemical molecules (using tags from the Chemical Markup Language [18]).

1.3 Vector Space Model

Historically, in information science, the documents were represented as bags of words [19]. Several filtering techniques are used to carefully choose the sequences of characters indexing the documents in order to limit the amount of memory needed to represent the documents and to speed up the treatment.
The vector space model was introduced to represent documents in search engines, but it can be used to represent any kind of objects. In this model, the objects (for example documents) are represented as a set of features, . For each object, , a vector of features is defined and a weight is associated with each feature ( being the total number of different features). Initially, the model supposes that for documents , but negative weights may be allowed to represent other objects [13]. Several methods exist to compute these weights based on quantity, , related to the feature for the object [3]. For textual documents, the features are selected terms contained in the documents, and is the number of occurrences of these terms.
Some models proposed to replace the different words by their stems (“connects”, “connected”, etc. are replaced by “connect”) and to remove the stopwords (“and”, “or”, “the”, etc. in English) [16, 12, 11].

1.4 Our Approach

As the examples of emails and XML documents shown, new types of information appear in documents. In particular, the metadata and the semantic information provided by XML structures should certainly contribute to the representations of the documents. Therefore, a major issue in information science is to propose a model to represent structured documents in general, XML documents in particular.
In practice, since most collections do not content structure documents only, a main constraint for the model is to manage documents with different levels of semantic information. Moreover, numerous algorithms, from information retrieval [2] to social browsing [14], were developed for non-structured textual documents. Ideally, the new model representing documents should be “as much compatible as possible” to the previous model in order to take advantage of existing researches in information science.
In this paper, an adaptation of the extended vector space model for objects (such as XML documents) is proposed. Specifically, it is suggested that each object is represented by a tensor in a concept space. This adaptation of the classical vector space model has a major advantage. Since there exist several algorithms using the vector space model, the proposed approach makes existing algorithms easily adaptable. In particular, if a new similarity measure taking the richness if the tensors into account exists (as proposed in this article), most information science algorithms (such as the documents clustering, users profiling or users clustering) should work without any changes (and use indirectly richest features if available). Moreover, collections mixing XML documents and non-structured documents can be managed.

2 Related Work

These last years, XML documents receive an increasing attention of the researchers to use the semantic information provided by the tags. A recent paper made a review of information technologies applied to structured documents (such as XML) [15]. These authors have identified four classes of problems where structure information were used : information classification, information indexing, information retrieval (information query and ranking), and information presentation (this latest class being not useful for this paper). Majority of existing works are related to information classification : a set of documents must be assigned into one (or several) categories (or classes). Based on pre-categorized documents, most of the approaches propose to identify first the features best describing each class (using structure information) and then to apply a well-known classification method (such as Bayesian networks). As explained in [15], two types of structure information may be used : internal structure information and hyperlink structure information. But hyperlinks are specific to certain XML documents (in particular XHTML documents), so using hyperlink structure information only is not applicable for any XML documents. We will therefore focus on methods exploiting internal structure information.
As explained previously, for the representation of textual documents, indexing the documents with all the index terms (features) is not necessary useful. In this paper, we propose to classify the methods depending if they represent XML documents with a full structure approach (all the tags are used to represent the document), or with a partial structure approach (only specific tags are used).

2.1 Full Structure Approaches

Since an XML document is a tree of nodes (where leafs representing “inclusion” relations), Yi and Sundaresan represent XML documents as vectors where each element of a vector may be a subvector (representing the tree below a given node), this structure being recursive [26]. While this representation provides promising results on a part of the US Patent database, it is not clear how such an approach can handle documents with multiples XML schemas where same kind of tags may be contained in different inclusion orders and therefore different subvectors. Moreover, if the non-XML documents may be represented with this model (the vectors containing only index terms and no subvectors), existing algorithms must be adapted to manage the recursive representation proposed.
Denoyer and Gallinari separate the nodes of a tree corresponding to an XML document into structure or content nodes [10]. They propose to train different classifiers on each pair of a structure nodes and the corresponding content nodes, the predictions of each classifier being then combined to find the corresponding category. The interesting idea of this approach is that the relevant features are the combination of the content and the semantic (represented by the structure node). On the other hand, this approach used for classification cannot be applied as such to other types of problems (for example clustering, profiling or retrieving).
In the field of XML documents retrieval, all the approaches use a full structure. Most of them are based on a structured query language where the users are able to build queries containing index terms and structure elements. The best known example of such query languages are the “official” XPath and XQuery expressions of the W3C [21, 5]. An interesting approach is proposed by [6]. The authors use the XML structure to disambiguate the words contained in the tags. If a given tag, , is searched and a document contains a tag related to in WordNet [B]  [B] WordNet is a freely and publicly available lexical database of English (http://wordnet.princeton.edu)., they suppose that a similarity exists between the two tags and the content associated to tag may be relevant to the query. This idea suggests that the combinations of tags and content are the “comparison unit” between XML structures. In recent years, some approaches have been proposed to manage keyword-based queries which are easier to formulate for “normal” users and do not need a knowledge of the underlying structure. Recently, an multi-criteria approach to rank XML fragments was included in the GALILEI platform [1]. This study shows that using the proximity between the tags (corresponding to certain keywords) and the content containing some keywords as criteria provides better ranking results. Again, this recalls the importance of the combination of the content and the corresponding tags.
It must be noticed that several methods compute some similarity measures between XML documents or between an XML document and its schema [24]. In particular, in the field of XML filtering, several solutions were developed to detect part of XML documents related to a set of predefined XPath or XQuery expressions [7]. Since these methods make global comparisons on the whole XML trees (documents or schemas), they are not well suited to compare XML documents sharing only small parts of their schemas (for example two different documents using the same subset of tags). Moreover, most of them are limited to the structure part of the documents and does not include the content. Therefore, these methods were not used to solve problems such as classification, clustering or profiling (at least in the knowledge of the author).

2.2 Partial Structure Approaches

Some attempts try to select specific tags and consider them as important features describing the documents. One of the easiest way is to use the classical vector space model where different weights are associated to the index terms depending of the tags containing them (index terms associated to the tags <title> and <meta> being the most important) [17]. A similar approach was previously implemented in the GALILEI platform [13].
Another solution was proposed by Wu and Tang for XML documents classification [23]. One of their assumptions is that each class is identified by key terms that are not in any other classes. The key terms are extracted from a set of pre-categorized documents, and each class is represented by key paths which are all the paths of the XML trees of the pre-categorized documents starting from the root tag and ending with a tag containing at least one key term. They compute a similarity between a document and the class based on a comparison between all the paths of the documents ending with a tag containing at least one key term and the corresponding key paths. The document is then assigned to the class with which it has the highest similarity. A drawback of this approach is that the key paths are not well suited to represent classes of XML documents having different schemas (a same tag or key term being in different levels of the hierarchy for different XML schemas will have different paths).
In the context of hypertext categorization, a study shows that, when available, using metadata as specific dimensions in the features spaces increases the quality of the classification [25]. In particular, one of the methods used by the authors for the classification is the classical k-Means algorithms where the distances between the documents are computed as the cosines between the corresponding vectors (including the metadata). This can be seen as a first adaptation of the classical vector space model.
A more formal adaptation of the classical vector space model (section 1.3↑) to use structure information was proposed by Fox [27]. His model suggests that each document is represented by a set of subvectors, each subvector representing a specific type of concepts called c-types (authors, bibliographical links, etc.). The c-types to use remain free for particular purposes. In the context of XML documents, the model suggested by Fox has already been used [28]. Adapted for the well-known INEX 2005 collection, the authors define 18 c-types corresponding to important tags of the INEX 2005 collection. The main drawback of this approach is that the c-types used are dependent of that particular INEX 2005 collection and cannot be used for other XML documents collections.

3 Tensor Space Model

As explained in the previous section, taking the metadata tags into account to represent documents (such as XML ones) is useful. We present here a model where each object is represented by a tensor in a concept space.

3.1 Concept space

We suppose that the objects are indexed in a concept space, , formed by concepts, . In fact, these concepts may be terms, stopwords, metadata, XML tags or attributes, language-independent meaningful entities (such as names, cities, organizations, etc.) [8], etc. A concept may also represent the “normal content of a document” (which I call “default content”). Moreover, we suppose that each concept, , as a given concept type. A concept type identifies concepts that share a same “structure”, for example metadata from the Dublin Core Metadata [9], tags and attributes from a same XML schema, French stopwords, English stopwords, etc. Finally, these concept types are regrouped into concept categories that are related to the “nature” of the concept (textual content, metadata, etc.). Figure 1↓ illustrates the hierarchy of concepts.
figs/conceptspace.svg
Figure 1 Concept hierarchy.
The concept types and categories do not influence the concept space, they serve only to recognize concepts that are related. We will use this information to propose a new similarity measure. Let us define two functions: that returns the type of a given concept, ; and that returns the category of a given concept, .

3.2 Concept Independence

This model shares an hypothesis with the vector space model : concepts (such as index terms) are independent. This mean that the presence of a given concept (for example the term “deep”) provides no information on the presence of another concept (for example the term “purple”). This hypothesis is, of course, a simplification: concepts in general, and terms in particular, are dependent. In practice, however, this hypothesis is acceptable.
First of all, taking the dependance into account is a tricky problem: it complicates the model, increases the computational power needed (in particular for a large corpus), and the existing attempts don’t performed better [3].
Secondly, since the model is statistical, such a dependance can be taken into account indirectly through the co-occurrences of the concepts. To illustrate that, let us suppose two domains: genetic algorithms (described with terms like “genetic”, “algorithm”, “performance”, “chromosome”, “heuristic”, etc.) and genetics (described with terms like “genetic”, “biology”, “chromosome”, “heredity”, etc.). To describe these domaines, the terms “genetic” and “chromosome” are not enough, we must combine them, i.e. they are dependent from other terms. Yet, it is reasonable to suppose that documents about genetic algorithms will also contained “heuristic” or “performance”, while the terms “biology” and “heredity” will appear in documents about genetics. This means that their descriptions don’t include the terms “genetic” and “chromosome” alone, but will also be defined by other terms in order to make clean distinction between the two domains.

3.3 Tensor representation

In his model, Fox suggests to associate to a given object a (sub)vector for each c-types identified [32]. If Fox’s model is attractive, it has a little drawback: his c-types are specific to the INEX 2005 collection. Therefore, I propose a generalization of it. In fact, for any document corpus, we may suppose that each object has multiple vectors representing it, each vector being associated to a particular meta-concepts (what Fox called c-type).
Le us imagine that a object (for example a document), , has a “author” metadata which is “Pascal Francq”, and contents a single sentence “The connections are optical fibres”. If we suppose that we remove the stopwords and take only the stems into account, the object has two elements:
  1. A meta-concept “author” (of the concept category “metadata”) associated with the two textual concept (terms) “Pascal” and “Francq”.
  2. A meta-concept “body” (of the concept catagory “textual”) associated with the three textual concept (terms) “connect”, “optic” and “fiber”.
Each element may be represented by a vector in the concept space: the first one will have two non-zero elements, the second one having three non-zero elements. In fact, “author” and “body” are themselves concepts in our space. Therefore, the object can be represented has a tensor, where represents the weight of the concept associated with a vector (meta-concept) and . The weights depend on the method used to compute the object descriptions. For example, in the case of a document, the concepts representing terms have a weight corresponding to the number of their occurrences in the document.
Let us suppose that the space contains 7 concepts: represents “author”, represents “body”, represents the term “Pascal”, represents the term “Francq”, represents the term “coffee”, represents the term “ connect”, represents the term “optic” and represents the concept “fiber”. The object is represented by the following tensor:
As this example shows, the matrix representing a tensor is highly sparse. This tensor can be seen as a set of vectors , each vector being associated to a particular meta-concept, . As for the concepts, we can define two functions: that returns the type of the meta-concept, , associated to the vector; and that returns the category of that meta-concept.
In fact, if we suppose that there is only one meta-concept (for example “body”), the tensor space model is identical to the classical vector space model. The tensor space model can therefore be qualified as an extension of the vector space model if the method used to compute the document descriptions can extract multiple meta-concepts.

3.4 Inverse Object Frequency

3.4.1 Feature Space

In the classical vector space model, the inverse frequency factor, , plays an important role in the weighting of the terms in the vector (document representation). The inverse frequency factor, , for a feature is defined by:
where is the total number of documents, and the number of those containing the feature .
For a given number of documents, increases when decreases: the discriminant value of a given feature decreases when it occurs often in the corpus. In fact, its value is null if it appears in each document of the corpus.
This factor can be generalized to any kind objects (documents, profiles, etc.) that are described with terms [13]:
where is the total number of objects, and the number of those containing the feature .

3.4.2 Concept Space

The inverse frequency factor defined by () could be extended to the concept space. After all, “concept” is just another name for “feature”. But a concept “is more” than a feature: it has a type. Let us suppose, for example, a subset, , of all objects of that contains some email addresses (considered as a particular concept type). Moreover, let us suppose that a given email address, , is always present in these objects. Its factor is computed by:
where represents the set of objects which don’t contain any email address.
In practice, since is contained in each object providing some email addresses, its discriminant value should be null. But, due to the quantity , it is not the case. Therefore, a modification is proposed to compute the inverse object frequency factor, , for concept, :
where is the total number of objects described by at least one concept of the same type as , and the number of those containing the concept .
When applying () to compute the inverse object frequency factor in the case of the email address just presented, its value is now null since .

3.5 Concept Weighting

The document descriptions computes tensors which associate to each vector/concept pair,  [C]  [C]  is a concept that represents a concept type, while is a “normal” concept., extracted from a document, , a quantity, , that reflects its importance (for example, the number of occurrences of a term in a particular metadata of the document). Before using these tensors in further tasks, such as computing a document similarity or a profile description, it is necessary to adapt these weights in order to take the whole document corpus into account.
Several methods adapt these weights [3]. One of the most used in information retrieval is based on the term frequency ( ) and inverse document frequency ( ) factors [19, 20]:
where represents the weights of an element in the vector, the total number of documents, the number of those containing feature and is the occurrence of the feature for the document .
These factors have an interpretation: the is sort of document length normalization while the infers the importance of a particular feature regarding its distribution in the corpus (a feature appearing very often in the documents of the corpus has a less importance).
In the previous section, I explained how the the inverse object frequency factor has to be adapted to a concept space where the concepts are structured in types. Similar reasoning can be done with the term frequency factor where the normalization of a particular weight is done regarding the vector (and the whole tensor). It is therefore possible to propose a concept weighting method for the tensor space model that transform a local weight :
where \strikeout off\uuline off\uwave off is also a tensor, and\uuline default\uwave default represents the absolute value of and its uses is necessary because, for some objects (such as profiles), the weights may be negative.
This formula can be adapted to any subset of objects, :
where \strikeout off\uuline off\uwave off is also a tensor, \uuline default\uwave default is the total number of objects of the subset described by at least one concept of the same type as , and the number of those containing the concept .
This latest formula allows to act as a magnifying glass that focuses on a particular subset of objects (for example all the documents assessed by a user with a given profile).

4 Tensor Operators

One of the nice properties of the vector space model is that the classical vector operators work :
  • the sum of the vectors describing two documents, and , equals the vector build from a document obtained by concatenating and ;
  • the multiplication of a vector describing a document, , by a number, , equals the vector build from a document obtained by concatenating times .
This does work with tensors when their represent a object description. To show that, let us suppose the following two tensors :
where and represent concept types (for example, a metadata “author”) and the other concepts correspond to terms. Moreover, the weights associated to and are not identical.
If we make a classic sum:
Identically, if we multiply the tensor by 3:

5 Implementation

As explained earlier, the matrices representing the tensors are sparse. Moreover, each row, , of a tensor, , represents a (sub)vector, associated to a given concept, .
In the GALILEI library, the class GDescription provides a representation for a tensor describing an object [D]  [D] In practice, each object having a description inherits from the class GDescriptionObject (document, profile, etc.). In practice, it is a container of vectors implemented trough the GVector class. Each vector is associated to a concept, , and a weight representing the element of the corresponding tensor. The new operators are implemented as operators of the classes GDescription and GVector.
The GALILEI library provides the class GConcept that represents a given concept. It has the method GetIf that computes the inverse frequency for a particular type of object. In practice, it is dynamically updated each time an object description is added (method GDescription::AddRefs) or removed (method GDescription::DelRefs) from the session. To compute the inverse frequency for a set of descriptions, the GALILEI library provides the class GDescriptionSet and its method GetIf.

References

[1] Faiza Abbaci, Pascal Francq: “Multi-criteria Decision Support to Rank XML Fragments”, Proceedings of the 2007 Internation Conference on Internet Computing, 2007.

[2] Ricardo Baeza-Yates, Berthier Ribeiro-Neto: Modern Information Retrieval. Addison-Wesley, 1999.

[3] Ricardo Baeza-Yates, Berthier Ribeiro-Neto: Modern Information Retrieval: The Concepts and Technology behind Search. Addison-Wesley Professional, 2011.

[4] Tim Berners-Lee, James Hendler, Ora Lassila: “The Semantic Web”, Scientific American, 2001.

[5] S. Boag, D. Chamberlin, M. F. Fernandez, D. Florescu, J. Robie, J. Simeon, M. Stefanescu: “XQuery 1.0: An XML Query Language”, W3C Working Draft, 2002.

[6] Davide Buscaldi, Giovanna Guerrini, Marco Mesiti, Paolo Rosso: “Tag semantics for the retrieval of XML documents”, Proceedings of the 1st international symposium on Information and communication technologies, pp. 273—278, 2003.

[7] Songting Chen, Hua-Gang Li, Jun'ichi Tatemura, Wang-Ping Hsiung, Divyakant Agrawal, K. Selçuk Candan: “Scalable Filtering of Multiple Generalized-Tree-Pattern Queries over XML Streams”, IEEE Transactions on Knowledge and Data Engineering, pp. 1627—1640, 2008.

[8] N. A. Chinchor: “Overview of MUC-7/MET-2”, Proceedings of the Seventh Message Understanding Conference (MUC-7), 1998.

[9] DCMI: Dublin Core Metadata Element Set, Version 1.1: Reference Description. 1999. URL: http://dublincore.org/documents/dces.

[10] Ludovic Denoyer, Patrick Gallinari: “Bayesian network model for semi-structured document classification”, Information Processing & Management, pp. 807—827, 2004.

[11] Willian B. Frakes: Stemming Algorithms in Information Retrieval: Data Structures & Algorithms (Frakes, William B. and Baeza-Yates, Ricardo, ed.). Prentice-Hall, 1992.

[12] W. Francis, H. Kucera: Frequency Analysis of English Usage. Houghton Mifflin, 1982.

[13] Pascal Francq: Collaborative and structured search: an integrated approach for sharing documents among users. 2003.

[14] Pascal Francq: The GALILEI Platform: Social Browsing to Build Communities of Interests and Share Relevant Information and Expertise in Open source for knowledge and learning management : strategies beyond tools (Lytras, Miltiadis D. and Naeve, Ambjorn, ed.). Idea Group Publishing, 2007.

[15] S. Liu, C. A. McMahon, S. J. Culley: “A review of structured document retrieval (SDR) technology to improve information access performance in engineering document management”, Computers in Industry, pp. 3—16, 2008.

[16] H. P. Luhn: “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”, IBM Journal of Research and Development, pp. 309—317, 1957.

[17] Andrea Molinari, Gabriella Pasi, R. A. Marques Pereira: “An indexing model of HTML documents”, Proceedings of the 2003 ACM symposium on Applied computing, pp. 834—840, 2003.

[18] Peter Murray-Rust: “CML - Chemical Markup Language”, World Wide Web Journal, pp. 135—147, 1997.

[19] G. Salton: Automatic Information Organization and Retrieval. McGraw-Hill, 1968.

[20] G. Salton: The SMART Retrieval System — Experiments in Automatic Document processing. Prentice Hall Inc., 1971.

[21] W3C: XML Path Language (XPath) Version 1.0. 1999. URL: http://www.w3.org/TR/xpath.

[22] W3C: eXtensible Stylesheet Language (XSL) Version 1.0. 2000. URL: http://www.w3c.org/TR/xsl.

[23] Junwei Wu, Jian Tang: “A bottom-up approach for XML documents classification”, Proceedings of the 2008 international symposium on Database engineering & applications, pp. 131—137, 2008.

[24] Guangming Xing, Chaitanya R. Malla, Zhonghang Xia, Snigdha Dantala Venkata: “Computing edit distances between an XML document and a schema and its application in document classification”, Proceedings of the 2006 ACM symposium on Applied computing, pp. 831—835, 2006.

[25] Y. Yang, S. Slattery, R. Ghani: “A Study of Approaches to Hypertext Categorization”, Journal of Intelligent Information Systems, pp. 219—241, 2002.

[26] Jeonghee Yi, Neel Sundaresan: “A classifier for semi-structured documents”, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 340—344, 2000.