Building the physical part of a Concept Network representing bibliographic references

As a part of the Building a Concept Network to represent bibliographic references process, one needs to build the physical part of the Concept Network. Its logical part is only constituted of the inter-field separators. Each node contains the name of the separated fields and their order of appearance (AUTHOR-TITLE is different from TITLE-AUTHOR), and from the XML string representing the separator. These separators are automatically extracted, so that the automatic building of the Concept Network should adapt to every bibliographic style. This automatic extraction use logical and physical knowledge. The tool start from the logical version of a reference and from its PostScript version. Given that, it ends by giving the XML-formatted separators. Meanwhile, it needs the XML version of the bibliographic database, and the name of the style to use (this work keeps to plain -- cf. physical structure of a bibliographic reference).

For each reference of the base (for example, that of Figure 1), a LaTeX2e file is generated, that contains only this reference, and from which is obtained (via dvips) a PostScript file (as the one in Figure 3, that is printed, roughly, as what is given in Figure 2). This PostScript file can be read by ghostscript, that is to say that textual information can be extracted (see Figure 4).

Figure 1: the bose94a reference in BibTeX.

@ARTICLE{bose94a,
 AUTHOR   = {C. B. Bose and S. Kuo},
 JOURNAL  = {Pattern Recognition},
 NUMBER   = {10},
 PAGES    = {1345--1363},
 TITLE    = {Connected and Degraded Text Recognition Using Hidden Markov Model},
 VOLUME   = {27},
 YEAR     = {1994},
 KEYWORDS = {texte, reconnaissance, hmm, segmentation, connexes, caractere}
}

Figure 2: the PostScript version of the bose94a reference.

C. B. Bose and S. Kuo. Connected and Degraded Text Recognition Using Hidden Markov Model. Pattern Recognition, 27(10):1345--1363, 1994.

Figure 3: extract of the PostScript file corresponding to the bose94a reference.

%%Page: 1 1
1 0 bop 220 266 a Fc(Refer)o(ences)220 358 y Fb(1)20
b(C.)11 b(B.)g(Bose)g(and)f(S.)h(K)o(uo.)17 b(Connected)10
b(and)h(De)o(graded)g(T)m(e)o(xt)g(Recognition)e(Using)h(Hidden)289
408 y(Marko)o(v)g(Model.)15 b Fa(P)m(attern)9 b(Recognition)p
Fb(,)g(27\(10\):1345\226136)o(3,)f(1994.)p eop
%%Trailer
end
userdict /end-hook known{end-hook}if
%%EOF

Figure 4: Physical XML version of the bose94a reference.

<Times-Roman>C. B. Bose and S. Kuo. Connected and Degraded Text Recognition 
Using Hidden Markov Model. </Times-Roman><Times-Italic>Pattern Recognition
</Times-Italic><Times-Roman>, 27(10):1345--1363, 1994.</Times-Roman>

Fields separators are deduced from the XML logical and physical versions of the reference, by finding the fields' location. Here, one hits a difficulty not yet solved, but that did not injure our approach: the fields must have the same structure, logically and physically. This penalizes for complex bibliographic styles, in which, for example, last name and first name are inverted. The format analysis should then go further during the BibTeX2XML transformation, until third level, separating last name from first name.

Figure 5: Automatically detected separators for the bose94a reference.

<doc>
 <longueur>218</longueur>
 <sep><chaine><Times-Roman></chaine>
      <empl>0</empl><champ1></champ1><champ2>author</champ2><champ3>doc</champ3></sep>
 <sep><chaine> and </chaine>
      <empl>23</empl><champ1>a</champ1><champ2>a</champ2><champ3>author</champ3></sep>
 <sep><chaine>. </chaine>
      <empl>34</empl><champ1>author</champ1><champ2>title</champ2><champ3>doc</champ3></sep>
 <sep><chaine> </chaine>
      <empl>45</empl><champ1>mot</champ1><champ2>mot</champ2><champ3>title</champ3></sep>
 <sep><chaine> </chaine>
      <empl>49</empl><champ1>mot</champ1><champ2>mot</champ2><champ3>title</champ3></sep>
...
 <sep><chaine>. </Times-Roman><Times-Italic></chaine>
      <empl>101</empl><champ1>title</champ1><champ2>journal</champ2><champ3>doc</champ3></sep>
 <sep><chaine> </chaine>
      <empl>138</empl><champ1>jw</champ1><champ2>jw</champ2><champ3>journal</champ3></sep>
 <sep><chaine></Times-Italic><Times-Roman>, </chaine>
      <empl>150</empl><champ1>journal</champ1><champ2>volume</champ2><champ3>doc</champ3></sep>
 <sep><chaine>(</chaine>
      <empl>182</empl><champ1>volume</champ1><champ2>number</champ2><champ3>doc</champ3></sep>
 <sep><chaine>):</chaine>
      <empl>185</empl><champ1>number</champ1><champ2>pages</champ2><champ3>doc</champ3></sep>
 <sep><chaine>, </chaine>
      <empl>197</empl><champ1>pages</champ1><champ2>year</champ2><champ3>doc</champ3></sep>
 <sep><chaine>.</Times-Roman></chaine>
      <empl>203</empl><champ1>year</champ1><champ2></champ2><champ3>doc</champ3></sep>
</doc>

Figure 5 shows the result of the separators automatic detection on the bose94a reference, in XML format (Figure 4). The longueur (length, in English) represents the number of characters of the XML physical version of the reference. empl is the index (the location, English for emplacement in French) of the first character of the detected separator. champ1 (field in English) is the left-side field for the separator. champ2 is the right-side one, and champ3 is the field containing the separator (all first-level fields are contained in the "root" field: doc).

Figure 6: Part of the Concept Network matching the Figure 5's separators.

                    +---------+
     +-----96%----->| -author |              +-------+
     |              |   672   |   +---33%--->|  a-a  |
     |              +---------+   |          |  299  |
     |                100% |      |          +-------+
     |                     v      |         100%| ^33%
     |               ##########   |             v |
     +~~~~~~~~~~~~~~># author #---+          #########
     |               #  906   #   +~~~~~~~~~>#   a   #
     |               ##########              #  900  #
     |                 79% |                 #########
     |                     v
     |              +------------+
     +-----78%----->|author-title|
     |              |    715     |
     |              +------------+
     |                100% |
     |                     v
     |               #########                +-----------+
     +~~~~~~~~~~~~~~># title #------+--100%-->| word-word |
     |               #  907  #      +         |   5922    |
     |               #########      +         +-----------+
     |                 25% |        +          100%| ^100%
     |                     v        +              v |
     |              +-------------+ +           #########
     +-----25%----->|title-journal| +~~~~~~~~~~># word  #
     |              |    232      |             #  907  #
     |              +-------------+             #########
     |                100% |
     |                     v
     |               ###########
     +~~~~~~~~~~~~~~># journal #
     |               #  287    #
     |               ###########
     |                 74% |
     |                     v                             LEGEND
     |              +--------------+        #########             +----------+
     +-----23%----->|journal-volume|        # field #  contains   | separator|
     |              |    232       |        # refs# #----weight-->|occurrence|
     |              +--------------+        #########             +----------+
#######               100% |                     +                     |preceeds
# doc #                    v                     +                     |weight
# 908 #              ##########                  +                     v
#######~~~~~~~~~~~~~># volume #                  +   contains     #############
     |               #  308   #                  +~~~~~~~~~~~~~~~># sub-field #
     |               ##########                                   #  refs #   #
     |                 37% |                                      #############
     |                     v
     |              +-------------+
     +-----12%----->|volume-number|
     |              |    117      |
     |              +-------------+
     |                100% |
     |                     v
     |               ##########
     +~~~~~~~~~~~~~~># number #
     |               #  195   #
     |               ##########
     |                 56% |
     |                     v
     |              +------------+
     +-----12%----->|number-pages|
     |              |    111     |
     |              +------------+
     |                100% |
     |                     v
     |               #########
     +~~~~~~~~~~~~~~># pages #
     |               #  574  #
     |               #########
     |                 38% |
     |                     v
     |              +----------+
     +-----24%----->|pages-year|
     |              |    223   |
     |              +----------+
     |                100% |
     |                     v
     |               #########
     +~~~~~~~~~~~~~~># year  #
     |               #  906  #
     |               #########
     |                 93% |
     |                     v
     |              +--------+
     +-----92%----->|  year- |
                    |    843 |
                    +--------+

Afterwards, the physical part of the Concept Network is built, using information automatically extracted to link separator nodes to field nodes (cf. Figure 6). The number included in the field box is the number of references in the base where the field appears. The one included in the separator box is the occurrence number of this separator in the base.

Figure 7: Outcoming and incoming from a separator links weights' calculation.

                      ###########
                      # field 3 #
                      #   C3    #
                      ###########
                           |contains
                           | min(100, 100 x Cs / C3)
                           v
########## preceeds  +-----------+  preceeds  ###########
# field 1#---------->| separator |-----------># field 2 #
#    C1  # 100xCs/C1 |    Cs     |    100     #    C2   #
##########           +-----------+            ###########

Weights indicated match the calculations given in Figure 7.

Disclaimer: as I don't speak fluently English, I accept all suggestions to improve writeups.

Concept Network	Building a Concept Network to represent bibliographic references	separator seeker	Physical structure of a bibliographic reference
International Institute of Bibliography	BibTeX	Ghostscript	Zettelkasten
Network user	A New Introduction to Bibliography	Bibliographic References	chicken omelet
LaTeX2e	Peg Bundy	Packet sniffer	bibliography
Echelon	PostScript	XML