This writeup details BAsCET's application on bibliographic references recognition. See BAsCET for an overview.
BAsCET's functioning is different according to the use that is done. Let us describe BAsCET's functioning on the bibliographic references's problem, then let us see some samples, the results given, and discuss the improvement to make (coming soon).
As explained in the writeup on BAsCET architecture, one begins with registering the first object constituting the problem to solve into the Blackboard. In the case of bibliographic references (which we will call "references" from now on), it is an instance of the Concept Network node field:doc that is the root of the logical and generic parts of the model.
This object is obtained from a PostScript file containing bibliographic references. François Parmentier wrote a tool that, using the PostScript interpreter ghostscript, extracts from the file the textual and typographic data. The extraction of the references from the whole document is manual, but it should be easy, in regular cases, to detect the bibliography and to isolate it.
Once that the object is created in the Blackboard, it father node in the Concept Network is activated. As this node has no agent,except the stop agent that can be launched only in the 8th cycle, no agent is run.
Next, the activation is propagated from this node to other bound nodes. Of course, the nodes that are first activated would be those that are the conceptually nearer, that is to say the fields separators.
During the six first cycles, only separator seekers and instance seekers are run. They look for the more direct to obtain information. Next, field seekers act, basing on already found separators, that could activate the fields nodes of the Concept Network. Then, zone seekers run, using the whole discovered knowledge: instances, of course, but also field separators, and sub-fields themselves. At this moment (8th cycle), BAsCET begins to plan its stopping, allowing fields nodes still activated to launch stop agents into the CodeRack.
Let us see BAsCET's run on a reference example, randomly taken from a BibTeX database, different from the one used to build the Concept Network; let us call it CNB for Concept Network Base. The chosen sample, which identification key is hermann88a, does not appear in the CNB, nor does its author (except as PUBLISHER field, that could only produce an error, not help the system). This part explain the output of the program running on an example (for another sample, see Parmentier and Belaïd 1997).
BAsCET's parameters are the minimal activation value (activation threshold above which a Concept Network node is considered as activated) and the percentage of agents to run at each stage (or cycle). Here, the minimal activation value is 50 (no deep study was done about the impact of this parameter on BAsCET's performance), and the agents to run percentage value is 40 (set by try-error (?)).
The XML string introduced as the problem to solve is:
<Times-Roman>M. Hermann. Vademecum of divergent term rewriting systems. In
</Times-Roman><Times-Italic>Proceedings BCS-FACS Term Rewriting Workshop
</Times-Italic><Times-Roman>, Bristol (UK), September 1988.</Times-Roman>
This an instance of field:doc. At the end of the first cycle (constituted from activation propagation-decay in the Concept Network, launch of the agents in the CodeRack, choice and run of agents), the Blackboard Temperature value is 85, no agent was run, the CodeRack is empty, and five nodes are activated (activation value superior to 50).
Table 1:: BAsCET's state at the end of the first cycle.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- --- --- --- --- --- ---- ---------
1 0 85 0/0 0/0 0/0 0/0 0/0 0 5
Table 1 sums up BAsCET's state at the end of the first cycle: Idle is null (that's the first cycle), temperature fell from 100 (by default, at the beginning) to 85, because of the deactivation of the field:doc node, that influences the importance of the only object of the Blackboard, the SS, FS, IS, ZS, SA agents did not succeed, the CodeRack still contains no agent, and there are five activated nodes in the Concept Network after activation propagation and deactivation.
During the second cycle, two separator seekers found out two kind of separators: author-title:. and -author:<Times-Roman>.
Table 2:: BAsCET's state at the end of the second cycle.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- --- --- --- --- --- ---- ---------
2 0 31 2/4 0/0 0/0 0/0 0/0 0 9
Table 2 shows that temperature highly fell. Indeed, four objects were built: three author-title separators, constituted from one dot followed by one space, and one separator of reference's beginning, recognized at 100% and ideally located (at the beginning of the reference, that may seem obvious, but the program deduced it from its statistics). The three separators between the authors' field and the title field are, the first the end of an author's initial (M._Hermann), the second,the real separator, and the third the beginning of the separator between the title and the name of the conference. But it is the second that best matches location statistics and thus obtains the best happiness of the three (90%). As all these objects got a good happiness, their eminence is relatively low. Consequently, temperature is also low.
In the third cycle, an FS agent, detector of the author field, discovers the -author and author-title separators, and deduces thus a cutting of this field inside the object field:doc lying at its beginning. Fortunately, the author-title separator having the best happiness is the real one, and so the cutting is right too. Building this field, the system deletes the wrong separator that existed.
Table 3:: BAsCET's state at the end of the third cycle.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- --- --- --- --- --- ---- ---------
3 0 34 0/1 1/2 0/0 0/0 0/0 3 23
Importance Happiness Eminence Content field
---------- --------- ---------- -------------- -------
100 10 90 ... doc
89 70 26 M. Hermann author
Table 3 shows that not all agent were run. Three agents remain in the CodeRack. The recognized field still has a high importance value, because of the activation of the field:author node, its happiness depend on those of the two separators that started its building. The object representing the problem is still very important (100%), and its happiness begins to raise, thanks to the already found objects (it has a 10 value), but its eminence value is still high.
During the fourth cycle, other separators are spotted. They are separators between journal and volume ("</Times-Italic><Times-Roman>, "), between title and booktitle (". In <Times-Roman><Times-Italic>"), and between month and year. Some are clearly false (there is no field journal, nor volume in that reference), whereas others are right. For example, the big separator (33 characters) between the title and the name of the conference overlays one more wrong separator (". In </Times..."). On the other hand, numerous little separators between month and year are wrong, because they are constituted of only one space (there are many spaces in the reference, e.g. in between the title words).
Table 4:: BAsCET's state at the end of the fourth cycle.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- --- --- --- --- --- ---- ---------
4 0 20 3/8 0/1 0/0 0/0 0/0 0 895
Importance Happiness Eminence Content field
---------- --------- ---------- -------------- -------
100 34 66 ... doc
89 70 26 M. Hermann author
Table 4 shows that, given the data discovered during this cycle, temperature fell to 20. This time, all the CodeRack's agents were run. But if it is empty, it is mainly due to a BAsCET's mechanism that deletes all the agents with the same name and the same Father node as the one that was just run (since an agent is supposed to find all the instances of its Father). Also, the activation came to the specific part of the Concept Network: there are 895 activated nodes (it is the part where the nodes are the more numerous).
Indeed, during the fifth cycle, IS agents find years' instance (1986, 1988 and 1987 are relatively near from the edition distance point of view). One instance of key:iso is detected in the address (Bristol), again because of the proximity of iso and isto. Moreover, two separators between pages and year and address are found. At last, the title field is correctly delimited, thanks to the existing separators.
Table 5:: BAsCET's state at the end of the fifth cycle.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- ---- --- ---- --- --- ---- ---------
5 0 18 2/19 1/4 4/22 0/0 0/0 43 4,878
Importance Happiness Eminence Content field descriptions
---------- --------- ---------- --------------------------------------------- ------- ------------
100 50 50 ... doc 12
89 70 26 M. Hermann author 0
90 91 8 1988 year 1
1 63 0 isto key 1
89 63 32 Vademecum of divergent term rewriting systems title 0
Temperature strongly fell, and 45 agents were run. There remains 43 agents. Table 5 show that almost 5,000 nodes are activated. This means that many specific nodes are activated.
During sixth cycle, mostly the IS are run (it is logical, because most of the activated nodes are specific ones). Some separators are also located (among them, separators between title's words).
Table 6:: BAsCET's state at the end of the sixth cycle.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- ---- --- -------- --- --- ----- ---------
6 0 21 3/41 0/5 24/1,436 0/0 0/0 2,211 5,769
Importance Happiness Eminence Content field descriptions
---------- --------- ---------- --------------------------------------------- ------- ------------
100 70 30 ... doc 27
89 77 20 M. Hermann author 1
90 95 8 1988 year 1
1 63 0 isto key 1
100 82 18 Vademecum of divergent term rewriting systems title 11
Concerning the discovered first-level fields, few thing changed, but Table 6 shows that objects' happiness raised, due to sub-fields discoveries, or to objects' rediscovery. For the doc field, it is normal: its descriptions number raises (to 17). Here, 1,482 agents were run, the active part of the process has begun. 2,211 agents remain in the CodeRack. BAsCET did run about 40% of the agents. There are even more activated nodes than in the previous cycle (though IS agents that did not find anything deactivated their Father nodes).
The ZS agent (zone seeker) acts first while the seventh cycle. Here, its action seems futile: it deduced, from one sub-field alone (a word from the month field), that a month field existed.
Table 7:: BAsCET's state at the end of the seventh cycle.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- ---- ---- -------- --- --- ----- ---------
7 0 17 1/22 1/16 17/2,079 1/3 0/0 1,552 5,842
Importance Happiness Eminence Content field descriptions
---------- --------- ---------- --------------------------------------------- ------- ------------
100 76 24 ... doc 25
90 77 20 M. Hermann author 1
90 95 4 1988 year 1
1 63 0 isto key 1
96 90 9 Vademecum of divergent term rewriting systems title 7
26 73 7 September month 1
The reader can look at Table 7 for more details on the seventh cycle. Note that there is a little more activated nodes than in previous cycle.
DZ agent's action is more observable in cycle number eight: it fails detecting a booktitle field, but, doing that, eliminated objects that could have prevent a future creation of this field. As a matter of fact, IS agents found very short word or cword (chapter's word) objects, as BC, -, Te, or B.
Table 8:: BAsCET's state at the end of the cycle eight.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- ---- ---- -------- --- --- ----- ---------
8 0 14 1/45 1/24 17/1,001 1/4 0/0 116 5,853
Importance Happiness Eminence Content field descriptions
---------- --------- ---------- --------------------------------------------- ------- --------------
100 76 24 ... doc 21
90 84 20 M. Hermann author 1
90 96 3 1988 year 1
1 81 0 isto key 1
88 90 8 Vademecum of divergent term rewriting systems title 8
26 100 0 September month 1
According to Table 8, the number of descriptions of the problem lowers, and it is a good sign: BAsCET finds larger and larger descriptions, and thus, more and more sure ones. The happiness of the author field raises thank to the action of agents that find again this field at the same location, validating it. Alas! No agent succeeded in finding an address, where lay the key field in time to delete it before it was confirmed. Its happiness thus raised, making it harder to eradicate. The importance of the title field diminishes, meaning that its activation value in the Concept Network lowers too.
The number of IS agents run diminished to half, thus showing the reduction of the number of specific nodes activated in the Concept Network. The CodeRack is vacated, due to the fact that agents that were not chosen to be run during the cycle of their sending had several occurrences, and that when they have been run, it is not only once agent, but all similar agents (same Father, same name) that were deleted from the CodeRack. Temperature keeps on lower.
Table 9:: BAsCET's state at the end of the cycle nine.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- ---- ---- -------- ---- --- ----- ---------
9 0 16 1/10 0/11 8/645 1/21 0/2 996 5,898
Importance Happiness Eminence Content field descriptions
---------- --------- ---------- --------------------------------------------- --------- --------------
100 65 35 ... doc 27
90 84 14 M. Hermann author 1
90 99 0 1988 year 1
98 94 5 Vademecum of divergent term rewriting systems title 9
26 100 0 September month 1
42 53 20 S-FACS booktitle 3
During cycle number nine, one ZS agent made a "mistake": it deleted a separator that was though correctly found. It also deleted the wrong key field. A stop agent was run, in spite of its low urgency value, meaning that the number of agents strongly diminished, or that the number of these agents in the CodeRack raised to form a meta-agent having a much higher virtual urgency value. Indeed, one agent having an urgency value of 50 is equivalent to ten agents having an urgency value of 5, from the point of view of a random weighted choice done in the CodeRack.
Table 9 shows that the doc field's happiness fell, because of the disappearance ofthe title-booktitle separator, that was very large, and thus of much influence on the happiness of the field that it described. The year field was many times confirmed by instance seekers of years, thus it gets now a happiness value of 99%. The more descriptions the title has, the more its eminence dips. A booktitle field was found, even if it is still very incomplete.
Only 689 agents were run during this step. Temperature raises a bit, due to the loss of a large information (the title-booktitle separator). Less instance seekers were launched, meaning that most of them already searched the Blackboard for specific nodes, and deactivated themselves. The propagation mechanism reactivated the corresponding nodes (the number of activated nodes keeps on raising).
Table 10:: BAsCET's state at the end of the cycle ten.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- ---- ---- -------- ---- --- ----- ---------
10 0 17 1/16 1/18 28/1,552 0/14 0/6 1,804 5,938
Importance Happiness Eminence Content field descriptions
---------- --------- ---------- --------------------------------------------- --------- --------------
100 70 30 ... doc 27
90 84 14 M. Hermann author 1
90 99 0 1988 year 1
97 90 9 Vademecum of divergent term rewriting systems title 8
26 100 0 September month 1
43 66 14 S-FACS booktitle 4
Table 10 show few novelty, except that the CodeRack begins to be filled again, and that the IS are more numerous than in the previous cycle.
Table 11:: BAsCET's state at the end of the cycle eleven.
Cycle Idle Temperature SS FS IS ZS SA Rack Activated
----- ---- ----------- ---- ---- -------- --- --- ----- ---------
11 0 17 3/28 0/20 13/1,377 1/5 1/6 1,718 5,936
Importance Happiness Eminence Content field descriptions
---------- --------- ---------- --------------------------------------------- --------- --------------
100 77 23 ... doc 20
90 84 14 M. Hermann author 1
90 99 0 1988 year 1
98 87 12 Vademecum of divergent term rewriting systems title 9
14 100 0 September month 1
43 66 14 S-FACS booktitle 4
2 60 1 Te chapter 1
In eleventh cycle, many fields are confirmed, tthe title-booktitle separator is found again, a (wrong) chapter field is discovered, and a stop agent decides that, with a temperature value of 17, it is time to stop the process.
The number of descriptions of the doc field strongly fell (cf. Table 11) thanks to the discovery of the large title-booktitle separator.
To evaluate what the system yields, several estimations can be used. One can compute a similarity value between the yielded fields and the expected fields.
Table 12:: Yielded - expected fields comparison.
field yielded expected similitude lengths
=================================================================================
author M. Hermann M. Hermann 100% 10/10
---------------------------------------------------------------------------------
booktitle S-FACS Proceedings BCS-FACS Term 48% 6/44
Rewriting Workshop
---------------------------------------------------------------------------------
chapter Te 0% 2/0
---------------------------------------------------------------------------------
month September September 100% 9/9
---------------------------------------------------------------------------------
title Vademecum of divergent Vademecum of Divergent 99% 45/45
term rewriting systems Term Rewriting Systems
---------------------------------------------------------------------------------
year 1988 1988 100% 4/4
---------------------------------------------------------------------------------
address Bristol (UK) 0% 0/12
Table 12 gives the similitude percentage for each of the yielded and expected fields. The title field gets a similitude value of 99%, it is only due to the reference formatter, BibTeX, that takes the uppercases away from this field when the reference is of type inproceedings and of bibliographic style plain.
There are two measures for the exactitude of the solution: the first one takes into account the fields' balance (let us call it BM, for Balance Measure), and their similitude with the expected fields; the second one takes only into account their similitude and their number (let us call it NM, for Number Measure). For this example, BM is 71%, whereas NM is 74%. This means that nearly 74% of the fields were recognized, and that 71% of the reference was recognized. Note that these values are near from the happiness value of the problem in the Blackboard (77%), and that the evaluation of the quality for the "description" of this problem by the system is quite good in this precise case.
From some point of view, BAsCET system looks for information (the fields) inside a "database" (the problem itself). The evaluation used thus came from the information retrieval domain, to better evaluate the system: recall and precision. Kerpedjiev 1991 also used this evaluation. More precisely, according to Smaïl 1994:
- Recall rate
- proportion of pertinent elements actually found, related to the total number of pertinent elements (F/P)
- Precision rate
- proportion of elements found actually pertinent, related to the total number of found (or discovered) elements (F/D)
The higher the recall rate, the less the result is called silent. The higher the precision, the less there is noise (see Figure 1).
Figure 1:: Silence and noise notions.
D %%%%%%%%%%%%%%%%%%%%
%% found elements %% silence
%%%%%%%%%%%%%%%%%%%% |
%%%%%%%#############*****|********
%%^%%%%##### F #####*****v********
%%|%%%%#############**************
| ***************************
noise ***** pertinent elements **
*************************** P
Therefore, for this reference, there is a recall of 83% (5 fields correctly found among 6) and a precision of 83% too (among the 6 proposed fields, only 5 are pertinent). Don't focus on the fields' content, but merely on the name of the found fields. Like this, although the booktitle field is incomplete, it is considered as pertinent field.
For this run, according to what the user wanted, one can be satisfied or not by the answer. For using to find fields, for example (if it is sufficient to locate words inside given fields), the answer is sufficient. For adding it in a database, it is far less sufficient. The efficiency of a system is also measured by its speed, that one can measure here with the number of agents run (a real measure of time would greatly depend on the configuration of the machine used to run the program). During the 11 cycles of the process, 8,467 agents were run. Some of them represents a heavier load than others. Table 13 show the distribution of these agents. Be aware that this distribution is valid only for this particular run, even if it gives sizes. One can see in BAsCET Results and Interpretations numbers based on much more processes, and thus more reliable.
Table 13:: Distribution of the agents run during the processing of the reference hermann88a.
SS IS FS ZS SA
=======================================================
Total 194 8,112 101 47 13
Average/cycle 17.6 737.4 9.1 4.2 1.1
Percentage 2.3 95.8 1.2 0.5 0.2
-------------------------------------------------------
Successfull 17 111 5 4 1
Success rate 0.08 0.01 0.04 0.08 0.07
Among the 6,295 specific nodes contained in the Concept Network, 8,112 agents were run (the instance seekers). This means that, when all of these nodes would have been run at least once, they each would have run one agent at least (the average value is 1.28 agent per node). That also means that, to improve the system's speed, one could think that diminish the number of specific activated nodes should be sufficient, since their agents have the lower success rate (0.01 versus about 0.07 for others). But on what criterion should one select the nodes? How to make only the pertinent nodes considered as activated? A whole study about the influence of the nodes activation threshold on the system results should be led.
Bibliography
- Parmentier and Belaïd 1997
- F. Parmentier and A. Belaïd.
Logical Structure Recognition of Scientific Bibliographic References.
In ICDAR'97, volume 2, pages 1072-1076, Ulm, Germany, August
18-20 1997. IEEE. Available at ftp://ftp.loria.fr/pub/loria/read/publications/parmenti-icdar97.ps.
- Kerpedjiev 1991
-
S. M. Kerpedjiev.
Automatic Extraction of Information Structures from Documents.
In First International Conference on Document Analysis and Recognition (ICDAR'91), volume 2, St Malo, France, 1991.
- Smaïl 1994
-
M. Smaïl.
Raisonnement à base de cas pour une recherche évolutive d'information; Prototype Cabri-n. Vers la définition d'un cadre d'acquisition de connaissances.
Thèse de doctorat, Université Henri Poincaré -- Nancy I, 14 octobre 1994.