Functioning of BAsCET - Everything2.com

This writeup details BAsCET's application on bibliographic references recognition. See BAsCET for an overview.

BAsCET's functioning is different according to the use that is done. Let us describe BAsCET's functioning on the bibliographic references's problem, then let us see some samples, the results given, and discuss the improvement to make (coming soon).

As explained in the writeup on BAsCET architecture, one begins with registering the first object constituting the problem to solve into the Blackboard. In the case of bibliographic references (which we will call "references" from now on), it is an instance of the Concept Network node field:doc that is the root of the logical and generic parts of the model.

This object is obtained from a PostScript file containing bibliographic references. François Parmentier wrote a tool that, using the PostScript interpreter ghostscript, extracts from the file the textual and typographic data. The extraction of the references from the whole document is manual, but it should be easy, in regular cases, to detect the bibliography and to isolate it.

Once that the object is created in the Blackboard, it father node in the Concept Network is activated. As this node has no agent,except the stop agent that can be launched only in the 8th cycle, no agent is run.

Next, the activation is propagated from this node to other bound nodes. Of course, the nodes that are first activated would be those that are the conceptually nearer, that is to say the fields separators.

During the six first cycles, only separator seekers and instance seekers are run. They look for the more direct to obtain information. Next, field seekers act, basing on already found separators, that could activate the fields nodes of the Concept Network. Then, zone seekers run, using the whole discovered knowledge: instances, of course, but also field separators, and sub-fields themselves. At this moment (8th cycle), BAsCET begins to plan its stopping, allowing fields nodes still activated to launch stop agents into the CodeRack.

Let us see BAsCET's run on a reference example, randomly taken from a BibTeX database, different from the one used to build the Concept Network; let us call it CNB for Concept Network Base. The chosen sample, which identification key is hermann88a, does not appear in the CNB, nor does its author (except as PUBLISHER field, that could only produce an error, not help the system). This part explain the output of the program running on an example (for another sample, see Parmentier and Belaïd 1997).

BAsCET's parameters are the minimal activation value (activation threshold above which a Concept Network node is considered as activated) and the percentage of agents to run at each stage (or cycle). Here, the minimal activation value is 50 (no deep study was done about the impact of this parameter on BAsCET's performance), and the agents to run percentage value is 40 (set by try-error (?)).

The XML string introduced as the problem to solve is:

<Times-Roman>M. Hermann. Vademecum of divergent term rewriting systems. In 
</Times-Roman><Times-Italic>Proceedings BCS-FACS Term Rewriting Workshop
</Times-Italic><Times-Roman>, Bristol (UK), September 1988.</Times-Roman>

This an instance of field:doc. At the end of the first cycle (constituted from activation propagation-decay in the Concept Network, launch of the agents in the CodeRack, choice and run of agents), the Blackboard Temperature value is 85, no agent was run, the CodeRack is empty, and five nodes are activated (activation value superior to 50).

Table 1:: BAsCET's state at the end of the first cycle.

                    Cycle Idle Temperature  SS  FS  IS  ZS  SA Rack Activated
                    ----- ---- ----------- --- --- --- --- --- ---- ---------
                       1    0          85  0/0 0/0 0/0 0/0 0/0   0         5

Table 1 sums up BAsCET's state at the end of the first cycle: Idle is null (that's the first cycle), temperature fell from 100 (by default, at the beginning) to 85, because of the deactivation of the field:doc node, that influences the importance of the only object of the Blackboard, the SS, FS, IS, ZS, SA agents did not succeed, the CodeRack still contains no agent, and there are five activated nodes in the Concept Network after activation propagation and deactivation.

During the second cycle, two separator seekers found out two kind of separators: author-title:. and -author:<Times-Roman>.

Table 2:: BAsCET's state at the end of the second cycle.

                    Cycle Idle Temperature  SS  FS  IS  ZS  SA Rack Activated
                    ----- ---- ----------- --- --- --- --- --- ---- ---------
                       2    0          31  2/4 0/0 0/0 0/0 0/0   0         9

Table 2 shows that temperature highly fell. Indeed, four objects were built: three author-title separators, constituted from one dot followed by one space, and one separator of reference's beginning, recognized at 100% and ideally located (at the beginning of the reference, that may seem obvious, but the program deduced it from its statistics). The three separators between the authors' field and the title field are, the first the end of an author's initial (M._Hermann), the second,the real separator, and the third the beginning of the separator between the title and the name of the conference. But it is the second that best matches location statistics and thus obtains the best happiness of the three (90%). As all these objects got a good happiness, their eminence is relatively low. Consequently, temperature is also low.

In the third cycle, an FS agent, detector of the author field, discovers the -author and author-title separators, and deduces thus a cutting of this field inside the object field:doc lying at its beginning. Fortunately, the author-title separator having the best happiness is the real one, and so the cutting is right too. Building this field, the system deletes the wrong separator that existed.

Table 3:: BAsCET's state at the end of the third cycle.

                    Cycle Idle Temperature  SS  FS  IS  ZS  SA Rack Activated
                    ----- ---- ----------- --- --- --- --- --- ---- ---------
                       3    0          34  0/1 1/2 0/0 0/0 0/0   3        23
   
                      Importance  Happiness  Eminence    Content      field
                      ----------  --------- ---------- -------------- -------
                             100         10        90      ...        doc
                              89         70        26  M. Hermann     author

Table 3 shows that not all agent were run. Three agents remain in the CodeRack. The recognized field still has a high importance value, because of the activation of the field:author node, its happiness depend on those of the two separators that started its building. The object representing the problem is still very important (100%), and its happiness begins to raise, thanks to the already found objects (it has a 10 value), but its eminence value is still high.

During the fourth cycle, other separators are spotted. They are separators between journal and volume ("</Times-Italic><Times-Roman>, "), between title and booktitle (". In <Times-Roman><Times-Italic>"), and between month and year. Some are clearly false (there is no field journal, nor volume in that reference), whereas others are right. For example, the big separator (33 characters) between the title and the name of the conference overlays one more wrong separator (". In </Times..."). On the other hand, numerous little separators between month and year are wrong, because they are constituted of only one space (there are many spaces in the reference, e.g. in between the title words).

Table 4:: BAsCET's state at the end of the fourth cycle.

                    Cycle Idle Temperature  SS  FS  IS  ZS  SA Rack Activated
                    ----- ---- ----------- --- --- --- --- --- ---- ---------
                       4    0          20  3/8 0/1 0/0 0/0 0/0   0       895
   
                      Importance  Happiness  Eminence    Content      field
                      ----------  --------- ---------- -------------- -------
                             100         34        66      ...        doc
                              89         70        26  M. Hermann     author

Table 4 shows that, given the data discovered during this cycle, temperature fell to 20. This time, all the CodeRack's agents were run. But if it is empty, it is mainly due to a BAsCET's mechanism that deletes all the agents with the same name and the same Father node as the one that was just run (since an agent is supposed to find all the instances of its Father). Also, the activation came to the specific part of the Concept Network: there are 895 activated nodes (it is the part where the nodes are the more numerous).

Indeed, during the fifth cycle, IS agents find years' instance (1986, 1988 and 1987 are relatively near from the edition distance point of view). One instance of key:iso is detected in the address (Bristol), again because of the proximity of iso and isto. Moreover, two separators between pages and year and address are found. At last, the title field is correctly delimited, thanks to the existing separators.

Table 5:: BAsCET's state at the end of the fifth cycle.

                    Cycle Idle Temperature  SS  FS   IS  ZS  SA Rack Activated
                    ----- ---- ----------- ---- --- ---- --- --- ---- ---------
                       5    0          18  2/19 1/4 4/22 0/0 0/0  43     4,878
   
Importance  Happiness  Eminence    Content                                      field  descriptions
----------  --------- ---------- --------------------------------------------- ------- ------------
       100         50        50  ...                                           doc               12
        89         70        26  M. Hermann                                    author             0
        90         91         8  1988                                          year               1
         1         63         0  isto                                          key                1
        89         63        32  Vademecum of divergent term rewriting systems title              0

Temperature strongly fell, and 45 agents were run. There remains 43 agents. Table 5 show that almost 5,000 nodes are activated. This means that many specific nodes are activated.

During sixth cycle, mostly the IS are run (it is logical, because most of the activated nodes are specific ones). Some separators are also located (among them, separators between title's words).

Table 6:: BAsCET's state at the end of the sixth cycle.

                    Cycle Idle Temperature  SS  FS     IS     ZS  SA  Rack Activated
                    ----- ---- ----------- ---- --- -------- --- --- ----- ---------
                       6    0          21  3/41 0/5 24/1,436 0/0 0/0 2,211     5,769
   
Importance  Happiness  Eminence    Content                                      field  descriptions
----------  --------- ---------- --------------------------------------------- ------- ------------
       100         70        30  ...                                           doc               27
        89         77        20  M. Hermann                                    author             1
        90         95         8  1988                                          year               1
         1         63         0  isto                                          key                1
       100         82        18  Vademecum of divergent term rewriting systems title             11

Concerning the discovered first-level fields, few thing changed, but Table 6 shows that objects' happiness raised, due to sub-fields discoveries, or to objects' rediscovery. For the doc field, it is normal: its descriptions number raises (to 17). Here, 1,482 agents were run, the active part of the process has begun. 2,211 agents remain in the CodeRack. BAsCET did run about 40% of the agents. There are even more activated nodes than in the previous cycle (though IS agents that did not find anything deactivated their Father nodes).

The ZS agent (zone seeker) acts first while the seventh cycle. Here, its action seems futile: it deduced, from one sub-field alone (a word from the month field), that a month field existed.

Table 7:: BAsCET's state at the end of the seventh cycle.

                    Cycle Idle Temperature  SS   FS     IS     ZS  SA  Rack Activated
                    ----- ---- ----------- ---- ---- -------- --- --- ----- ---------
                       7    0          17  1/22 1/16 17/2,079 1/3 0/0 1,552     5,842
   
Importance  Happiness  Eminence    Content                                      field  descriptions
----------  --------- ---------- --------------------------------------------- ------- ------------
       100         76        24  ...                                           doc               25
        90         77        20  M. Hermann                                    author             1
        90         95         4  1988                                          year               1
         1         63         0  isto                                          key                1
        96         90         9  Vademecum of divergent term rewriting systems title              7
        26         73         7  September                                     month              1

The reader can look at Table 7 for more details on the seventh cycle. Note that there is a little more activated nodes than in previous cycle.

DZ agent's action is more observable in cycle number eight: it fails detecting a booktitle field, but, doing that, eliminated objects that could have prevent a future creation of this field. As a matter of fact, IS agents found very short word or cword (chapter's word) objects, as BC, -, Te, or B.

Table 8:: BAsCET's state at the end of the cycle eight.

                    Cycle Idle Temperature  SS   FS     IS     ZS  SA  Rack Activated
                    ----- ---- ----------- ---- ---- -------- --- --- ----- ---------
                       8    0          14  1/45 1/24 17/1,001 1/4 0/0   116     5,853
   
Importance  Happiness  Eminence    Content                                      field   descriptions
----------  --------- ---------- --------------------------------------------- ------- --------------
       100         76        24  ...                                           doc               21
        90         84        20  M. Hermann                                    author             1
        90         96         3  1988                                          year               1
         1         81         0  isto                                          key                1
        88         90         8  Vademecum of divergent term rewriting systems title              8
        26        100         0  September                                     month              1

According to Table 8, the number of descriptions of the problem lowers, and it is a good sign: BAsCET finds larger and larger descriptions, and thus, more and more sure ones. The happiness of the author field raises thank to the action of agents that find again this field at the same location, validating it. Alas! No agent succeeded in finding an address, where lay the key field in time to delete it before it was confirmed. Its happiness thus raised, making it harder to eradicate. The importance of the title field diminishes, meaning that its activation value in the Concept Network lowers too.

The number of IS agents run diminished to half, thus showing the reduction of the number of specific nodes activated in the Concept Network. The CodeRack is vacated, due to the fact that agents that were not chosen to be run during the cycle of their sending had several occurrences, and that when they have been run, it is not only once agent, but all similar agents (same Father, same name) that were deleted from the CodeRack. Temperature keeps on lower.

Table 9:: BAsCET's state at the end of the cycle nine.

                    Cycle Idle Temperature  SS   FS     IS     ZS  SA  Rack  Activated
                    ----- ---- ----------- ---- ---- -------- ---- --- ----- ---------
                       9    0          16  1/10 0/11   8/645  1/21 0/2   996     5,898
   
Importance  Happiness  Eminence    Content                                      field     descriptions
----------  --------- ---------- --------------------------------------------- --------- --------------
       100         65        35  ...                                           doc               27
        90         84        14  M. Hermann                                    author             1
        90         99         0  1988                                          year               1
        98         94         5  Vademecum of divergent term rewriting systems title              9
        26        100         0  September                                     month              1
        42         53        20  S-FACS                                        booktitle          3

During cycle number nine, one ZS agent made a "mistake": it deleted a separator that was though correctly found. It also deleted the wrong key field. A stop agent was run, in spite of its low urgency value, meaning that the number of agents strongly diminished, or that the number of these agents in the CodeRack raised to form a meta-agent having a much higher virtual urgency value. Indeed, one agent having an urgency value of 50 is equivalent to ten agents having an urgency value of 5, from the point of view of a random weighted choice done in the CodeRack.

Table 9 shows that the doc field's happiness fell, because of the disappearance ofthe title-booktitle separator, that was very large, and thus of much influence on the happiness of the field that it described. The year field was many times confirmed by instance seekers of years, thus it gets now a happiness value of 99%. The more descriptions the title has, the more its eminence dips. A booktitle field was found, even if it is still very incomplete.

Only 689 agents were run during this step. Temperature raises a bit, due to the loss of a large information (the title-booktitle separator). Less instance seekers were launched, meaning that most of them already searched the Blackboard for specific nodes, and deactivated themselves. The propagation mechanism reactivated the corresponding nodes (the number of activated nodes keeps on raising).

Table 10:: BAsCET's state at the end of the cycle ten.

                    Cycle Idle Temperature  SS   FS     IS     ZS  SA  Rack  Activated
                    ----- ---- ----------- ---- ---- -------- ---- --- ----- ---------
                      10    0          17  1/16 1/18 28/1,552 0/14 0/6 1,804     5,938
   
Importance  Happiness  Eminence    Content                                      field     descriptions
----------  --------- ---------- --------------------------------------------- --------- --------------
       100         70        30  ...                                           doc               27
        90         84        14  M. Hermann                                    author             1
        90         99         0  1988                                          year               1
        97         90         9  Vademecum of divergent term rewriting systems title              8
        26        100         0  September                                     month              1
        43         66        14  S-FACS                                        booktitle          4

Table 10 show few novelty, except that the CodeRack begins to be filled again, and that the IS are more numerous than in the previous cycle.

Table 11:: BAsCET's state at the end of the cycle eleven.

                    Cycle Idle Temperature  SS   FS     IS     ZS  SA  Rack Activated
                    ----- ---- ----------- ---- ---- -------- --- --- ----- ---------
                      11    0          17  3/28 0/20 13/1,377 1/5 1/6 1,718     5,936
   
Importance  Happiness  Eminence    Content                                      field     descriptions
----------  --------- ---------- --------------------------------------------- --------- --------------
       100         77        23  ...                                           doc               20
        90         84        14  M. Hermann                                    author             1
        90         99         0  1988                                          year               1
        98         87        12  Vademecum of divergent term rewriting systems title              9
        14        100         0  September                                     month              1
        43         66        14  S-FACS                                        booktitle          4
         2         60         1  Te                                            chapter            1

In eleventh cycle, many fields are confirmed, tthe title-booktitle separator is found again, a (wrong) chapter field is discovered, and a stop agent decides that, with a temperature value of 17, it is time to stop the process.

The number of descriptions of the doc field strongly fell (cf. Table 11) thanks to the discovery of the large title-booktitle separator.

To evaluate what the system yields, several estimations can be used. One can compute a similarity value between the yielded fields and the expected fields.

Table 12:: Yielded - expected fields comparison.

          field      yielded                 expected                    similitude lengths
          =================================================================================
          author     M. Hermann              M. Hermann                     100%     10/10
          ---------------------------------------------------------------------------------
          booktitle  S-FACS                  Proceedings BCS-FACS Term       48%      6/44
                                             Rewriting Workshop
          ---------------------------------------------------------------------------------
          chapter    Te                                                       0%      2/0
          ---------------------------------------------------------------------------------
          month      September               September                      100%      9/9
          ---------------------------------------------------------------------------------
          title      Vademecum of divergent  Vademecum of Divergent          99%     45/45
                     term rewriting systems  Term Rewriting Systems
          ---------------------------------------------------------------------------------
          year       1988                    1988                           100%      4/4
          ---------------------------------------------------------------------------------
          address                            Bristol (UK)                     0%      0/12

Table 12 gives the similitude percentage for each of the yielded and expected fields. The title field gets a similitude value of 99%, it is only due to the reference formatter, BibTeX, that takes the uppercases away from this field when the reference is of type inproceedings and of bibliographic style plain.

There are two measures for the exactitude of the solution: the first one takes into account the fields' balance (let us call it BM, for Balance Measure), and their similitude with the expected fields; the second one takes only into account their similitude and their number (let us call it NM, for Number Measure). For this example, BM is 71%, whereas NM is 74%. This means that nearly 74% of the fields were recognized, and that 71% of the reference was recognized. Note that these values are near from the happiness value of the problem in the Blackboard (77%), and that the evaluation of the quality for the "description" of this problem by the system is quite good in this precise case.

From some point of view, BAsCET system looks for information (the fields) inside a "database" (the problem itself). The evaluation used thus came from the information retrieval domain, to better evaluate the system: recall and precision. Kerpedjiev 1991 also used this evaluation. More precisely, according to Smaïl 1994:

Recall rate: proportion of pertinent elements actually found, related to the total number of pertinent elements (F/P)
Precision rate: proportion of elements found actually pertinent, related to the total number of found (or discovered) elements (F/D)

The higher the recall rate, the less the result is called silent. The higher the precision, the less there is noise (see Figure 1).

Figure 1:: Silence and noise notions.

                              D %%%%%%%%%%%%%%%%%%%%
                                %% found elements %%  silence
                                %%%%%%%%%%%%%%%%%%%%     |
                                %%%%%%%#############*****|********
                                %%^%%%%##### F #####*****v********
                                %%|%%%%#############**************
                                  |    ***************************
                               noise   ***** pertinent elements **
                                       *************************** P

Therefore, for this reference, there is a recall of 83% (5 fields correctly found among 6) and a precision of 83% too (among the 6 proposed fields, only 5 are pertinent). Don't focus on the fields' content, but merely on the name of the found fields. Like this, although the booktitle field is incomplete, it is considered as pertinent field.

For this run, according to what the user wanted, one can be satisfied or not by the answer. For using to find fields, for example (if it is sufficient to locate words inside given fields), the answer is sufficient. For adding it in a database, it is far less sufficient. The efficiency of a system is also measured by its speed, that one can measure here with the number of agents run (a real measure of time would greatly depend on the configuration of the machine used to run the program). During the 11 cycles of the process, 8,467 agents were run. Some of them represents a heavier load than others. Table 13 show the distribution of these agents. Be aware that this distribution is valid only for this particular run, even if it gives sizes. One can see in BAsCET Results and Interpretations numbers based on much more processes, and thus more reliable.

Table 13:: Distribution of the agents run during the processing of the reference hermann88a.

                                SS        IS       FS    ZS     SA
               =======================================================
               Total           194     8,112      101    47     13
               Average/cycle    17.6     737.4      9.1   4.2    1.1
               Percentage        2.3      95.8      1.2   0.5    0.2
               -------------------------------------------------------
               Successfull      17       111        5     4      1
               Success rate      0.08      0.01     0.04  0.08   0.07

Among the 6,295 specific nodes contained in the Concept Network, 8,112 agents were run (the instance seekers). This means that, when all of these nodes would have been run at least once, they each would have run one agent at least (the average value is 1.28 agent per node). That also means that, to improve the system's speed, one could think that diminish the number of specific activated nodes should be sufficient, since their agents have the lower success rate (0.01 versus about 0.07 for others). But on what criterion should one select the nodes? How to make only the pertinent nodes considered as activated? A whole study about the influence of the nodes activation threshold on the system results should be led.

Bibliography

Parmentier and Belaïd 1997: F. Parmentier and A. Belaïd.
Logical Structure Recognition of Scientific Bibliographic References.
In ICDAR'97, volume 2, pages 1072-1076, Ulm, Germany, August 18-20 1997. IEEE. Available at ftp://ftp.loria.fr/pub/loria/read/publications/parmenti-icdar97.ps.
Kerpedjiev 1991: S. M. Kerpedjiev.
Automatic Extraction of Information Structures from Documents.
In First International Conference on Document Analysis and Recognition (ICDAR'91), volume 2, St Malo, France, 1991.
Smaïl 1994: M. Smaïl.
Raisonnement à base de cas pour une recherche évolutive d'information; Prototype Cabri-n. Vers la définition d'un cadre d'acquisition de connaissances. Thèse de doctorat, Université Henri Poincaré -- Nancy I, 14 octobre 1994.

BAsCET	stop agent	zone seeker	field seeker
instance seeker	separator seeker	Bibliographic References