Influence and co-occurrence (idea) by parmentier

How to compute weights of Concept Network links (in the process of building a Concept Network to represent bibliographic references)? The exploitation of bibliographic database terms pointed out the use of co-occurrence.

In knowledge acquisition domain, it is admitted that there are two ways of proceeding: upward, and downward. The downward one is called "onomasiologique" (in French, the English counterpart should be onomasiologic) and starts with the conceptual level (a model) to understand texts. This manner is efficient when the documents are tightly structured, but remains counterproductive on unexpected knowledge. The upward manner is called "sémasiologique" (in French, translation should be semasiologic), and starts with data to build conceptual entities. The building of the Concept Network can then be called "sémasiologique", as it start from data of the references base in order to build concepts.

In Frath et al. 1995, authors say that, to them "meaning is built ... thanks to a combination: the constituents of a syntagm act to semantically constrain each other, and by doing so specify the syntagm's meaning" (rough translation for "le sens se construit essentiellement grâce à une combinatoire: les constituants d'un syntagme exercent les uns sur les autres des contraintes sémantiques qui en restreignent et donc en précisent le sens.") Their system, that helps to extract from a text conceptual entities and relationships, extracts repeated segments, simplifies them, generalizes them morphologically (rough lemmatization (I don't if this word exists in English)), and then searches for pair of words co-occurrences. This relationships are then manually labelled. Our understanding of the meaning is similar: a word's meaning becomes more and more accurate only thanks to others words (or concepts) that are associated to it.

Researchers analyzing human understanding during reading pointed out structures similar to those of the Concept Network: for Fayol (Fayol 1992), schemas designate knowledge "blocks" concerning a domain; they are constituted from semantic networks which elements have privileged relationships because of their frequent co-occurences. Thus, to Fayol, elements co-occurring frequently can be brought nearer one from another. Moreover, it appears that, in literature, authors refer to an activation mechanism, and that this activation diffuses through networks constituting the schemas. Likewise, Segui teaches us (Segui 1992) that showing a stimulus-word activate not only its own lexical representation, but also that of a set of words matching its orthographic neighbors, in order to quickly delimit the candidates to recognize during the reading. Going from the strict orthographic recognition's frame to the conceptual recongition, orthographic neighborhood should be replaced by conceptual neighborhood. He also says that, experimentally, it is possible to act on a word recognition, by previously modifying the activation value of its most frequent neighbors.

In its Ph.D. thesis about associations analysis (Michelet 1988), Michelet says: "giving the most relevant associations of a term allows one to reconstitute a definition for it: the essence for a definition is association. ". He shows association indices, based on terms co-occurrence. According to his definition, "an association index must yield non-decreasing values when co-occurrence increase." That's pretty obvious: the more two terms appear often together, the more their association is important (in our case: the more their mutual influence on each other is high). Moreover, "an association index between two terms must not increase if a record containing only one the two terms is added to the base." It would be damageable that such an addition modifies the influence of a term on another in such a manner: the two terms association would increase while their co-occurrence would not vary.

Let C_i be the occurrence value of the i object in a base of size N.

Let C_{i j} be the number of records in this base where objects i and j co-occur.

The equivalency index:

E_{i j} = C²_{i j} / (C_i x C_j)

"shows all the `good' properties ...: it's a local association index"

Knowing that an association index is homogeneous if it remains constant when all its variables are multiplied by a constant factor, and local if it does not depend on the base's size.

This equivalency index gives a notion of conceptual proximity, that is to say that two terms appearing often in the same record should be bound, conceptually. As Michelet says: "Statistical association coefficients can be used to give an idea of the structural links existing in the vocabulary. ... statistical aggregations don't send to a `logical' linking, but, on the contrary, to a convergence of interest.."

As we wish to obtain a way of computing an influence of a node on another, we can transform the equivalency index into bidirectional influence (i.e. to have the same influence from node 1 towards node 2 as from bide 2 towards node 1). It would be an acceptable behavior for some applications (for example, a different application whose links where doubled to make them bidirectional is the Traveling Salesman Problem, that has also been coded). But in the case of the bibliographic references, one wants that a term 1 can influence a term 2 in a different way that term 2 on term 1. Indeed, let's take the example of an author and one of his co-authors. Let A₁ be the first author and A₂ his co-author in a reference. Let C₁ be the number of appearances of A₁ in the base, and C₂ be that of A₂ in the same base. Let C_{1 2} be the number of joint articles of the two authors. Let's give values to these variables: C₁ = 50, C₂ = 5, C_{1 2} = 4.

For the equivalency index, E_{1 2} = 6.4%. However one can easily see that A₂ is much more related with A₁ than A₁ with A₂, since almost the totality of its references has A₁ as a co-author.

The inclusion index (Michelet 1988) translates much better this concept of "influence" of one term on another one:

I_i→j = C _{i j} / C_i

Here, I_1→2 = 4/50 = 8% whereas I_2→1 = 4/5 = 80%.

As the activation value of a node propagates according to its influences towards the other nodes, and that a node is activated when an agent find one of its instances in the Blackboard, it is better to use the inclusion index to represent the influence of A₂ on A₁.

Indeed, if the system updates A₂, there is a 80% probability (using the learning base statistics), that A₁ was also in the reference to treat, whereas if the system find A₁, it has only a 8% likelihood to find A₂ in the same reference.

Always according to (Michelet 1988): "if one observes a property a, when there is a P1 probability that one also observes the property b, this probability is estimated by the relative frequency of appearance of b knowing that a exists, i.e. by the inclusion coefficient I_{a b} = C_{a b} / C_a." The influence I_i→j is therefore an estimate of the probability that one observes the term j knowing that one observed the term i, it is thus an estimate of the conditional probability P(j|i).

This writeup is closely related with the Building a Concept Network to represent bibliographic references writeup.

Bibliography

Fayol 1992: M. Fayol.
La lecture, processus, apprentissage, troubles, chapter La compréhension lors de la lecture: un bilan provisoire et quelques questions, pages 79-101.
Presses Universitaires de Lille, 1992.
Frath et al. 1995: P. Frath, R. Oueslati and F. Rousselot.
Identification de relations sémantiques par repérage et analyse de co-occurrences de signes linguistiques.
In Actes des journées d'Acquisition des Connaissances, pages 173-185, Grenoble, 5-7 avril 1995.
Michelet1988: B. Michelet.
L'analyse des associations.
Thèse de doctorat, Université de Paris VII, UFR de Chimie, Paris, 26 Octobre 1988.
Spécialité: Information Scientifique et Technique.
Segui1992: J. Segui.
La lecture, processus, apprentissage, troubles, chapter Les composantes cognitives de la lecture, pages 43-53.
Presses Universitaires de Lille, 1992.

Disclaimer: as I don't speak fluently English, I accept all suggestions to improve writeups.

Disclaimer bis: I translated the citations of this writeup. If you ask, I can add the original French version.

Inductive reasoning vs. Deductive reasoning	Some thoughts about the Language of Thought	Building a Concept Network to represent bibliographic references	Crush Depth
Building the logical part of a Concept Network representing bibliographic references	co-occurrence	Bibliographic References	Concept Network concept
Concept Network link	Concept Network	Pacing	BAsCET blackboard
activation value	Concept Network influence	Concept Network node	inductive reasoning
deductive reasoning	cream soda	Bayesian Network	Bibliographic
cognitive	The traveling salesman problem	semantic network