Bibliographic references automatic recognition is the main application of BAsCET in François Parmentier's Ph.D. thesis. The aim of this work is to extract the logical structure of the bibliographic references located at the end of scientific papers. They are found as paper copy, or electronic documents, containing only physical data (PostScript, PDF, HTML, and so forth). Noise is annoying for OCR to read digital documents (images). That's why images are not treated here, electronic documents without logical structure are sufficiently hard to study.

For availability reasons, BibTeX databases and tools were chosen. BibTeX format's hierarchical structure is therefore used. Data is automatically generated (in PostScript format), from a BibTeX database. To limit the problem's complexity, the only bibliographic style used is the BibTeX default's one: plain.

This application need an automatic building of Concept Network, and generic micro-structure recognition agents. The estimation of the results in terms of quality is rather difficult. However, one can say that, according to the retained criteria the global recognition score on a test base of 1117 references has a value of 65.5%. This score is largely improvable (by creating specific agents, by previously having a linguistic treatment on the base, by a finer parameters tuning, ...). Some improvements have also been proposed.

Log in or register to write something here or to contact authors.