Caveat - I use the BNC, and things like it sometimes for my work, but I am not offering or claiming detailed technical knowledge of the process of encoding, tagging and processing the samples involved in its creation.
The British National Corpus (BNC) is a snapshot of the English language in the first half of the 1990's. Over 4,000 sample texts, 90% written, 10% spoken (and converted into text), were gathered, a total of roughly 100 million words long. I'm thinking of Borges' "Library of Babel" as I type - it would take up about 10 metres of shelf space. Read aloud quickly for 8 hours per day, it would take roughly 4 years to finish. These stats and estimations were made by the people who made the BNC. The BNC focuses on the English language, but may include loanwords, doesn't really show any change over time, and it's general rather than genre-specific.
What is it for?
I mainly use it with students for putting language into context, but it's handy for a few different things. You can type any word or phrase, and the sentences within the BNC where your search term occurs will be displayed, giving you examples of how the search term is used. The BNC, and other corpora like it, are extremely useful for developing contextual awareness of vocabulary for learners. Often learners focus on the definition of vocabulary and the part of speech, but then use new language inappropriately. Things like the BNC help them to take specific action to "develop a feel" for language, something that is obviously not easy.
The BNC is downloadable, and gives a lot of information about itself and the types of samples it includes. It can be used to run comparative statistical analysis of a fairly big language sample for quantitative linguistic research. You could also, for example, use the data to pull out high frequency words of a certain type and use your findings to supplement a course of language study. A corpus (body or large sample of language) is where the authors of reference books go to get answers. Dr. Johnson had one in his head, of course. Corpora are also extensively used in training AIs and other programs to recognise and process natural language.
The BNC is far from perfect, it was out of date as soon as it was finished. Search for fugazi, and you get 4 sentences about the band, and nothing about the word itself. When students run a search, they get all the results, so grouping the results into contexts and understanding the different senses of an item still a challenge, but this process in itself is highly valuable.
Crucially, the BNC is a sample of English as it is actually used. It is compiled from real examples of the language, newspapers, essays, radio phone-ins. As a result, searches are unpredictable as well as useful. I've had some funny moments in class when the search is being displayed on a projector or TV screen.
Using the BNC