Building the logical part of a Concept Network representing bibliographic references

The logical part of the Concept Network (see logical structure of a bibliographic reference) is constituted of a translation of the information of the base: they are the generic part (fields' hierarchy, see Building a Concept Network to represent bibliographic references) and the specific one (instances of the fields found in the base).

The reference base used to build the model is a BibTeX-formatted base, containing references on the field of handwriting recognition and document analysis.

It contains 908 references (or records) fixed to be more "ideal" references, that is to say matching the format, and the most coherent possible. Indeed, in spite of the practice of the users of this base, some references were badly written: bad type (article instead of inproceedings, techreport instead of phdthesis, etc), bad choice of fields (number instead of volume, note instead of other more specific fields, content of the address field in the publisher field, etc), bad syntax in the fields (authors separated by a comma instead of and, and al. instead of and others, etc).

The base contains French and English references. Table 1 shows that the main types (75%) are inproceedings and article.

Table 1: repartition of the references’ types in the database

                    Type           Occurrence
                    inproceedings  387  43%
                    article        288  32%
                    book            61   7%
                    techreport      51   6%
                    phdthesis       39   4%
                    misc            30   3%
                    inbook          23   3%
                    incollection    16   2%
                    manual          10   1%
                    proceedings      2   0%
                    booklet          1   0%

A second level field is a sub-field added at the time of conversion from the BibTeX format to XML. Second level sub-fields only have an implicit existence in BibTeX. For example, the a field is a sub-field of the author field.

Every author being separated from others by " and ", one can syntactically separate each author from the others. Likewise, for the editor field which gives e, and publisher which gives pub.

The keywords field contains keywords separated by commas. In XML, they are placed in the k sub-field. This field almost never appears in the physical version of a reference (and never when the plain bibliographic style is used), but it can be useful to keep its instances in the Concept Network to link various terms, it is a not negligible information, in order to activate more quickly the right concept.

The word field is a sub-field of title: the conversion program from DILIB (BibTeX2Sgml) separates this field's words, and the words of the fields booktitle, journal, type, chapter, month, and school.

Table 2: repartition of the fields in the 908 references

                           Field         # references  # instances 2nd level
      Address             address             381            142
      Author              a                   900          1,150       X
      Booktitle           bw                  403            348       X
      Category                                  6              4       X
      Chapter             cword                22             74       X
      Editor              e                    78             60       X
      Howpublished        howpublished         31             31
      Institution         institution          55             37
Doc   journal             jw                  287            135       X
      Key                 key                   6              6
      Keywords            k                   555            405       X
      Month               mw                  257             75       X
      Note                note                 17             15
      Number              number              195             58
      Organization        organization         22             16
      Pages               pages               574            564
      Publisher           pub                 154            120       X
      Ref                 ref                 908            908
      School              sw                   40             63
      Series              series                2              2
      Title               word                907          1,917       X
      Type                tw                   40             39       X
      Volume              volume              308             82
      Year                year                906             44

Table 2 show the number of references containing each field, the number of different instances of each field (for the leaves of the hierarchy), and the membership or not to the second level.

For the first level fields, one notes that the instances number is lower than or equal to the references number. It is normal, considering that if, for example, the instance of the address field "New York" appears in n references, then this instance will be counted only once as an instance.

The specific logical part contains the instances of the fields hierarchy leaves (to eliminate some instances, as empty words, on pretext of that they a priori don’t bring any exploitable information, is not a good calculation), but it contains mostly intra- and inter-fields links. That is to say that there exist links between terms of one same field (intra-field links), but also between terms belonging to different fields (inter-fields links).

Like that, one remembers the links that an author (for example) has with the other authors, but also those that he has with the words of the title (to look for words that he often use), with the names of the journals or conferences in which he is often published, ...

All these links are weighted according to the formula of the inclusion index (I_i→j= C_{i j} / C_i).

The specific network is a priori wholly connected, that means that, for model containing 6,295 nodes, the number of links has a value of 2 x 6295², that is to say 79 millions, which is enormous to manage. Now, many of these links have a null weighting (all the terms that never appeared in common references). For example, the intra-field links of the year field are all null, because this field is unique in a reference. All the links having a null weighting are deleted from the model.

Nevertheless, there remain links so feeble, that they can be neglected. These ones too are deleted. Knowing that the BAsCET parameters varied along the tuning of the application, the threshold have been stated experimentally. In practical, on 6,000 terms, one keeps only 96,000 links.

Figure 1: Specific structure of a Concept Network for references

              +----------+           +-----+             +------+
           +->|A:C.Y.Suen|<=========>|W:ocr|<=====   ===>|Y:1995|<=++
           |  +----------+           +-----+      \ /    +------+  ||
           |                            ^          X               ||
           |  +-------------+      +----------+   / \    +------+  ||
           +->|A:S.N.Srihari|<====>|W:document|<==   ===>|Y:1993|  ||
              +-------------+      +----------+          +------+  ||
                    ^                                              ||
                    ||                                             ||
                    ++=============================================++

          Legend:  <====> inter-field links  <----> intra-field links

Concerning empty words, they are nevertheless added to the model, because, even if they do not bring meaning, they can help to discriminate the fields: if the "of" string is found, one can be sure that it does not belong to the field year.

In order to not deactivate too rapidly the terms appearing the more often in the base (and thus, on which one can base oneself to obtain good results), their Conceptual Importance (CI) is higher than that of the terms appearing less often.

Figure 2: Instance of a hierarchy leaf

                                       DOC
                                        |
                                     contains
                                        |
                                        v
                                     (lea)F
                                        |
                                  instantiated to
                                        |
                                        v
                                    I(nstance)

Let F be the leaf of the hierarchy that is the father of I (its instance, see figure 2), CI(F), its conceptual importance, Occ(I), the number of occurrences of I in the base, and MaxOcc(I), the maximal number of an instance of F in the base.

IC(i) = IC(F) + (100 - IC(F)) x Occ(I) / MaxOcc(I)

This formula lets the conceptual importance of an instance be between its father’s conceptual importance and 100, the most frequent instances having the higher conceptual importance. Thus, terms which are sure, that is to say, those that are in the Concept Network, known from the system, will deactivate slower. The field from in which they are will receive more activation, and thus will deactivate slower than if its instance had been discovered but, without being confirmed by the presence in the Concept Network of a term known to belong to it.

The decay rate of each node depends on the number of its incoming links, thus also from number of nodes that influence it (see activation propagation).

As a matter of fact, the more a node receive influences, the more it risks to be activated by one or more of its influencing nodes already activated. That’s why the decay rate should be higher for nodes having many incoming links. But if the decay rate was a linear function of the number of incoming links, there would be a problem: each influencing node is not necessarily activated, at a given moment. There are even nodes that are scarcely activated, and that still would influence greatly many nodes, if they were activated. A node appearing rarely in the base would surely have outcoming links strongly weighted towards all the terms that appeared in the same reference. Yet, if it appeared only once in the base, it would have hardly no chance a priori to be discovered in a reference to recognize.

Thus, we chose to use a logarithmic scale. Table 3 shows that the number of incoming links of the fields instances varies between 1 and 36, knowing that the average is 8.21 incoming links per node (for 5,895 inter-instances links). Therefore, the more a link have incoming nodes, the higher is its decay rate. The nodes decay rates depends on the number of their incoming links (IL), according to this formula:

DR = 100 – (100 x ln 3) / ln(3 + IL)

Table 3: Number of incoming links of the instances

    a  address bw category cword e  howpublished institution jw k  key word
Min 1  1       1  3        1     3  2            2           1  1  3   1
Max 23 23      27 13       24    25 11           17          36 36 10  36

    mw note number organization pages pub ref sw tw volume year
Min 1  4    1      1            1     1   1   3  1  1      1
Max 28 15   17     19           25    26  25  18 19 18     18

Agents are described in a coming node (BAsCET Agents for bibliographic reference recognition), but for now, one wants to know what kind of agents are needed, to what nodes they will be assigned, and what a priori urgency value they will have. The a priori urgency value is a base for obtaining the real urgency value for each agent in the coderack. It is multiplied by the activation value of the father node.

Here are the different agents types:

instance seeker, that searches the Blackboard to find all the instances of the specific node that launches it
separator seeker, that searches the Blackboard to find all the instances of the separator node that launches it
field seeker, that searches the Blackboard to find all the instances of the field node that launches it, according to the separator surrounding it
zone seeker, that searches the Blackboard to find all the instances of the field node that launches it, according to what this field can contain
stop, that determines randomly and according to the temperature, if the treatment has to stop.

There are three types of nodes: fields, specific nodes, and separators. Here are the agent types and the urgency values assigned to each of these types:

field has three agents types:

zone seeker: average urgency value = 50
field seeker: urgency value = 55, to have a priority a little more higher than that of the zone seeker, thus it is more often run, and the zone seeker, when run later can correct possible forgetting of the field seeker.
stop: urgency value = 5, to be chosen only at the end of a step, and that one does not miss opportunities.

specific node has only one agent type:

instance seeker: 50, average value inferior to that of the field seeker, to let the global priority to the generic search for fields. It is not useful to look for known words if you realize that no term, or very few, is present in the blackboard.

The specific nodes are the more numerous in the Concept Network, that’s why if they had each a stop agent, this type of agent would be highly majority in the coderack, and the behavior of the program would be changed. Stop agents would have a higher run-probability, thus the system would stop much earlier. This strategy of precocious stop is not wished, so the specific node have no stop agent.

separator has also only one agent type:

separator seeker: 60, its the higher urgency value, because it’s particularly on separators that one can, and that one have to, base on to find the limit between fields. If the known terms are few, it is useless to rely on zone seekers to find fields, the only efficient agent being that relying on found separators. The separator seeker thus have to be run before the field seeker (which has a 55 urgency value).

An agent may have nothing to do before a certain step of the treatment (the stop agent is not really appropriate in the first step, one does not think that the problem will be solved in only one step). That’s why a new value has been defined: that of the beginning step. It has been experimentally fixed. The first thing that the system looks for is separators. Thus the most precocious beginning step is the separator seeker’s: 0. Then, in order to that activation values propagate early enough in the specific part of the Concept Network, and in order to activate the right fields, the agents allowed to run then are the instance seekers (1). Then comes the field seeker (6), the zone seeker (8), and the stop agent (8).

Table 4 recapitulates the agents types characteristics in the system.

Table 4: Characteristics of the different agent types

          Agent              name   urgency  beginning father node
          Stop               ST     5        8         FIELD
          Field seeker       FS     55       6         FIELD
          Zone seeker        ZS     50       8         FIELD
          Instance seeker    IS     50       1         SPECIFIC
          Separator seeker   SS     60       0         SEPARATOR

Disclaimer: as I don't speak fluently English, I accept all suggestions to improve writeups.

Building a Concept Network to represent bibliographic references	field seeker	empty words	Wharfinger
stop agent	zone seeker	separator seeker	Building Hierarchical Structures in the Blackboard
logical structure of a bibliographic reference	Concept Network	Network user	Network manager
Lloyds TSB	Influence and co-occurrence	Bibliographic References	Blackboard temperature
BAsCET blackboard	BAsCET agents	activation value	Concept Network influence
decay rate	activation propagation	Concept Network link	Concept Network node