How to compute weights of Concept Network links (in the process of building a Concept Network to represent bibliographic references)? The exploitation of bibliographic database terms pointed out the use of co-occurrence.
In knowledge acquisition domain, it is admitted that there are two ways of proceeding: upward, and downward. The downward one is called "onomasiologique" (in French, the English counterpart should be onomasiologic) and starts with the conceptual level (a model) to understand texts. This manner is efficient when the documents are tightly structured, but remains counterproductive on unexpected knowledge. The upward manner is called "sémasiologique" (in French, translation should be semasiologic), and starts with data to build conceptual entities. The building of the Concept Network can then be called "sémasiologique", as it start from data of the references base in order to build concepts.
In Frath et al. 1995, authors say that, to them "meaning is built ... thanks to a combination: the constituents of a syntagm act to semantically constrain each other, and by doing so specify the syntagm's meaning" (rough translation for "le sens se construit essentiellement grâce à une combinatoire: les constituants d'un syntagme exercent les uns sur les autres des contraintes sémantiques qui en restreignent et donc en précisent le sens.") Their system, that helps to extract from a text conceptual entities and relationships, extracts repeated segments, simplifies them, generalizes them morphologically (rough lemmatization (I don't if this word exists in English)), and then searches for pair of words co-occurrences. This relationships are then manually labelled. Our understanding of the meaning is similar: a word's meaning becomes more and more accurate only thanks to others words (or concepts) that are associated to it.
Researchers analyzing human understanding during reading pointed out structures similar to those of the Concept Network: for Fayol (Fayol 1992), schemas designate knowledge "blocks" concerning a domain; they are constituted from semantic networks which elements have privileged relationships because of their frequent co-occurences. Thus, to Fayol, elements co-occurring frequently can be brought nearer one from another. Moreover, it appears that, in literature, authors refer to an activation mechanism, and that this activation diffuses through networks constituting the schemas. Likewise, Segui teaches us (Segui 1992) that showing a stimulus-word activate not only its own lexical representation, but also that of a set of words matching its orthographic neighbors, in order to quickly delimit the candidates to recognize during the reading. Going from the strict orthographic recognition's frame to the conceptual recongition, orthographic neighborhood should be replaced by conceptual neighborhood. He also says that, experimentally, it is possible to act on a word recognition, by previously modifying the activation value of its most frequent neighbors.
In its Ph.D. thesis about associations analysis (Michelet 1988), Michelet says: "giving the most relevant associations of a term allows one to reconstitute a definition for it: the essence for a definition is association. ". He shows association indices, based on terms co-occurrence. According to his definition, "an association index must yield non-decreasing values when co-occurrence increase." That's pretty obvious: the more two terms appear often together, the more their association is important (in our case: the more their mutual influence on each other is high). Moreover, "an association index between two terms must not increase if a record containing only one the two terms is added to the base." It would be damageable that such an addition modifies the influence of a term on another in such a manner: the two terms association would increase while their co-occurrence would not vary.
Let Ci be the occurrence value of the i object in a base of size N.
Let Ci j be the number of records in this base where objects i and j co-occur.
The equivalency index:
Ei j = C2i j / (Ci x Cj)
"shows all the `good' properties ...: it's a local association index"
Knowing that an association index is homogeneous if it remains constant when all its variables are multiplied by a constant factor, and local if it does not depend on the base's size.
This equivalency index gives a notion of conceptual proximity, that is to say that two terms appearing often in the same record should be bound, conceptually. As Michelet says: "Statistical association coefficients can be used to give an idea of the structural links existing in the vocabulary. ... statistical aggregations don't send to a `logical' linking, but, on the contrary, to a convergence of interest.."
As we wish to obtain a way of computing an influence of a node on another, we can transform the equivalency index into bidirectional influence (i.e. to have the same influence from node 1 towards node 2 as from bide 2 towards node 1). It would be an acceptable behavior for some applications (for example, a different application whose links where doubled to make them bidirectional is the Traveling Salesman Problem, that has also been coded). But in the case of the bibliographic references, one wants that a term 1 can influence a term 2 in a different way that term 2 on term 1. Indeed, let's take the example of an author and one of his co-authors. Let A1 be the first author and A2 his co-author in a reference. Let C1 be the number of appearances of A1 in the base, and C2 be that of A2 in the same base.
Let C1 2 be the number of joint articles of the two authors. Let's give values to these variables:
C1 = 50, C2 = 5, C1 2 = 4.
For the equivalency index, E1 2 = 6.4%. However one can easily see that A2 is much more related with A1 than A1 with A2, since almost the totality of its references has A1 as a co-author.
The inclusion index (Michelet 1988) translates much better this concept of "influence" of one term on another one:
Ii→j = C i j / Ci
Here, I1→2 = 4/50 = 8% whereas I2→1 = 4/5 = 80%.
As the activation value of a node propagates according to its influences towards the other nodes, and that a node is activated when an agent find one of its instances in the Blackboard, it is better to use the inclusion index to represent the influence of A2 on A1.
Indeed, if the system updates A2, there is a 80% probability (using the learning base statistics), that A1 was also in the reference to treat, whereas if the system find A1, it has only a 8% likelihood to find A2 in the same reference.
Always according to (Michelet 1988): "if one observes a property a, when there is a P1 probability that one also observes the property b, this probability is estimated by the relative frequency of appearance of b knowing that a exists, i.e. by the inclusion coefficient Ia b = Ca b / Ca." The influence Ii→j is therefore an estimate of the probability that one observes the term j knowing that one observed the term i, it is thus an estimate of the conditional probability P(j|i).
This writeup is closely related with the Building a Concept Network to represent bibliographic references writeup.
- Fayol 1992
- M. Fayol.
La lecture, processus, apprentissage, troubles, chapter La compréhension lors de la lecture: un bilan provisoire et quelques questions, pages 79-101.
Presses Universitaires de Lille, 1992.
- Frath et al. 1995
P. Frath, R. Oueslati and F. Rousselot.
Identification de relations sémantiques par repérage et analyse de co-occurrences de signes linguistiques.
In Actes des journées d'Acquisition des Connaissances, pages 173-185, Grenoble, 5-7 avril 1995.
- B. Michelet.
L'analyse des associations.
Thèse de doctorat, Université de Paris VII, UFR de Chimie, Paris, 26 Octobre 1988.
Spécialité: Information Scientifique et Technique.
- J. Segui.
La lecture, processus, apprentissage, troubles, chapter Les composantes cognitives de la lecture, pages 43-53.
Presses Universitaires de Lille, 1992.
Disclaimer: as I don't speak fluently English, I accept all suggestions to improve writeups.
Disclaimer bis: I translated the citations of this writeup. If you ask, I can add the original French version.