Overview of commercial and free thesauri

Slashdot it! Delicious Share on Facebook Tweet! Digg!
Miss_j, 123RF

Miss_j, 123RF

Spicy Vocabulary

A deft use of language can give a text the perfect finish. Those who wish to steer clear of commercial thesauri can find world-class free software alternatives.

A simple search for GPS apps for your smartphone these days can turn up results covering everything from baseball players to Moroccan cuisine. Often, Internet search results only make sense when you add terms to the search itself.

The reason lies in the analysis of the language, which is integrated into the search engine's utilization process. This analysis looks at the relationships between the words based on various criteria, which are then used to refine the results. Computers also appear to understand what you are looking for by providing you with exact and varied results even if typos are present within your search terms.

Although the user interface (e.g., an input screen) often appears quite mundane, the process behind these searches has little in common with a simple phrase or keyword search [1]. In the past, a query didn't find files or documents unless the correctly chosen (and correctly spelled) search terms were used.

However, search engines have evolved during the past 20 years into sophisticated language processors in which many criteria play a vital role. Search engines use more than 50 different processes including the language of the document, the format and structure, technical terms within the metadata that register how often a document is referenced, or a link itself (e.g., degree of network integration).

The result then is based on concepts that belong together thematically. In addition to a large dose of statistics, knowledge about the linguistic context of the individual words plays a major role. This originates from the linguistic components of the thesauri (see the "Origins" box).

Origins

The term thesaurus comes from the ancient Greek word thesauros , which means treasure or treasure trove . Unsurprisingly, the word for thesaurus in Latin is also "thesaurus" and refers to a thematically ordered collection of certain criteria in which objects are related one to another – a "storehouse of knowledge" if you will. A thesaurus is essentially a dictionary based on linguistics and information science, that compiles the entire vocabulary of a language.

In the 1950s, the thesaurus became a specialized reference book with controlled, limited vocabulary pertaining to individual words as well as their relations to other words. The basis for the vocabulary used in a thesaurus was developed under the authority of the German National Library [2] and the Library of Congress Subject Headings (LCSH) [3]. Synonyms, as well as generic terms and narrower terms, are primarily used. The relationships between the terms are standardized in accordance with DIN 1463-1 (more specifically ISO 2788) and are known as associations and references.

Current examples include the thesaurus Linguae Latinae (abbreviated ThlL or TLL) [4] for Latin, Linguae Graecae (TLG) [5] or the UNESCO Thesaurus [6].

The latter is more of a collective work in the area of education, science, culture, social sciences and humanities, information and communication sciences, political science, law, and economic sciences. All of the entries are in English, French, Spanish, and Russian. The European Thesaurus on International Relations and Area Studies [7] and the Getty Thesaurus of Geographic Names (TGN) [8] are also very useful. These are now available as open data for anyone who is interested (Figure 1).

Figure 1: Detailed search results and hierarchy for the German term "Warnemünde" on TGN.

Linguistic thesauri are based on applied linguistics and use mindmaps to organize vocabulary. The main goals for this are to display the invisible connections (semantics) between words of different origins and to illustrate the similarity (relations and associations) between each word.

Additionally, thesauri serve to explore the history of language and to determine specific meanings and their history. Thesauri assist people in day-to-day life as dictionaries of synonyms. These references can help create you more elegant expressions and may also help you become more skillful in a given language.

In IT, thesauri often come bundled with word processors, as well as with search engines. They often form the basis for spellcheckers and reinforce assistance with proper grammar. KThesaurus [9] and OpenThesaurus for LibreOffice [10] are just a couple examples of the practical application of thesauri.

Projects and Tools

The Swiss website Lexikon.ch [11] offers an introduction to research in German-speaking countries. This offers a special search engine for encyclopedias, thesauri, dictionaries, quotation collections, abbreviations, and rhyming dictionaries. The site catalogs both commercial and free projects.

Woxikon [12], Leo [13], and Beolingus/Dict [14], among others, are offered solely online. Woxikon and Leo offer supplemental information for Slavic, Romance, and Scandinavian languages. Beolingus/Dict focuses on English, Spanish, and Portuguese. Leo and Beolingus/Dict were developed through the scientific endeavors of TU Munich and TU Chemnitz, and they work in cooperation with the French National Center for Textual and Lexical Resources (CNRTL) [15] in Nancy (Lothringen) as well as with OpenThesaurus [16] and WordNet [17].

Commercial thesauri as resources were traditionally presented in book form (e.g., Oxford English Dictionary, Roget's, German Standard Edition of Duden). Most publishers integrate their thesauri into their online editions and make it possible to use on a web browser or mobile app. Admittedly, these websites are more oriented toward occasional users.

The publishers provide an interface (API) for unlimited use and integration into your own application. Working with this requires that you register and acquire an API key. You will need to use this key for every query.

The Macmillan Dictionary [18], Merriam-Webster [19], and Cambridge Dictionaries Online [20] provide their results in XML Data or Javascript Object Notation (JSON) and comply with current, accepted web standards. Listing 1 shows a query on Merriam-Webster and Listing 2 displays the subsequent search result.

Listing 1

Merriam-Webster Query

http://www.dictionaryapi.com/api/v1/references/thesaurus/xml/umpire?key=API-Key

Listing 2

Merriam-Webster Result

<entry id="umpire">
 <term>
  <hw>umpire</hw>
 </term>
 <fl>noun</fl>
 <sens>
  <mc>a person who impartially decides or resolves a dispute or controversy</mc>
  <vi>usually acts as <it>umpire</it> in the all-too-frequent squabbles between
  the two other roommates</vi>
  <syn>adjudicator, arbiter, arbitrator, referee, umpire</syn>
  <rel>jurist, justice, magistrate; intermediary, intermediate, mediator,
  mediatrix, moderator, negotiator; conciliator, go-between, peacemaker,
  reconciler, troubleshooter; decider</rel>
 </sens>
</entry>

The competing dictionary publishers Pons and Langenscheidt take two slightly different approaches. Pons provides a connection to an in-house database as a standalone service [21]. Langenscheidt, on the other hand, focuses their offerings in the form of print books and specific apps for different mobile devices.Visual Thesaurus [22] allows users to take a look at the facets of individual words by the way of graphs. These graphs show the connections between words visually as individual nodes and edges in a web browser (Figure 2).

Figure 2: The Visual Thesaurus website maps out the connections between words on a tree, using the word "help" in this example.

The image is created with Javascript and can be rotated in any direction, by clicking on the desired node. If you don't have an API key, you may make only a limited number of queries. Enough queries are generally provided to give the user an impression of the service.

Wordnik

The commercial product Wordnik [23] is a kind of enhanced dictionary for the English language that focuses particularly on the issue of mobile devices. The range of functions includes descriptions, explanations of the meanings of the word, as well as a large number of examples. Wordnik combines results from many different sources simultaneously (i.e., from Wiktionary or WordNet). See Figure 3.

Figure 3: The commercial product Wordnik is characterized by a clear and well-structured search result.

All modules are available under the Apache license. The source code can be found in a GitHub repository. The connection is possible via various modules and interfaces for Python, Ruby, JavaScript, Java, and PHP. The application requires registration by the manufacturer before you receive the corresponding API key.

Buy this article as PDF

Express-Checkout as PDF

Pages: 5

Price $0.99
(incl. VAT)

Buy Ubuntu User

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content