Sketch Engine is a software for text analysis, database management and corpus management for over 90 languages developed by Lexical Computing Limited and released in 2003. It is a great tool for linguists, lexicographers, translators, students, and teachers to understand how language works.

By using Sketch Engine you are able to instantly identify the characteristics of the language, what is rare or unusual and what is increasingly used. This is possible due to its algorithms which can analyse authentical texts with billions of words. The software’s functions enable users to precisely search and filter queries in language corpora and said functions are based on mathematical and statistical computations.

Word Sketches, one of Sketch Engine’s key features, gave it its name. They are automatic, corpus-based one page summaries of the grammatical and collocational behaviour of a word.

Sketch Engine has been used by Macmillan English Dictionary, Dictionnaires Le Robert, Oxford University Press or Shogakukan as well as four of the biggest dictionary publishers of the UK.

In this paper I will give you a brief overview of Sketch Engine’s core functions, tell you details about its development, the people who originally had the idea to create it and about the free corpus-based web tool “SkeLL”.1

Lexical Computing Limited is a company founded by the lexicographer and research scientist Adam Kilgarriff in 2003. Sketch Engine was initially released on 23 July 2003.2

The company Lexical Computing Limited provides large high-quality word databases, lexical data, word lists, lexicons in many languages and similar language data for use in other software or for lexicographic projects.

Large databases of authentic text called text corpora are the source of their data. Their largest corpora contain texts with 40,000,000,000 words.

This kind of data makes it possible to generate databases of up to hundreds of millions of items while staying accurate and reliable. Their customers include software developers, dictionary and language teaching material publishers and people who need reliable language data.

They also provide services regarding full-text search, terminology extraction, document classification and categorization, data mining and information retrieval.

The biggest and most popular product of Lexical Computing is Sketch Engine.3

Adam Kilgarriff is the Founder of Lexical Computing and he was a central figure of Sketch Engine until November 2014. He was diagnosed with cancer and died 6 months later. Adam spent his entire life researching the intersection of corpus linguistics, computational linguistics and lexicography.4

He studied at Cambridge University and graduated with a Bachelor of Arts degree in philosophy and engineering in 1982. In 1987, he started his Master of Science in intelligent knowledge-based systems at the University of Sussex and continued with a Doctor of Philosphy in computational linguistics with his thesis "Polysemy” (1992).

He was diagnosed with stage 4 bowel cancer in November 2014. After the cancer was diagnosed, he started his own blog where he wrote about his experience with the disease, his thoughts on language, corpus linguistics, life and the world in general. Adam succumbed to the cancer in May 2015.

His work on polysemy is what brought Adam Kilgarriff to corpus linguistics and text corpora which he devoted the rest of his career to. Kilgarriff invented the concept of word sketches, which are representing the main part of the Sketch Engine.5

Pavel Rychly is a computer scientist and the main software architect of Sketch Engine, as well as the original author of many of its components but mainly the Manatee corpus indexing system. Pavel is also a researcher in natural language processing. He has a Doctor of Philosophy on indexing text corpora and has since then turned to efficient large-scale text processing.6

Sketch Engine has a lot of different features and consists of three main components. Manatee is a database management system which is used to index text corpora consisting of billions of words. Bonito is a web interface search for Manatee which allows it to search for corpora. The third component is called Corpus Architect which is a web interface used for corpus building and management.7

They gather information from millions of examples of use and create a one-page summary of categorised collocations with links to examples. You will know how a word is used by simply looking at this one page.8

