A simple search for GPS apps for your smartphone these days can turn up results covering everything from baseball players to Moroccan cuisine. Often, Internet search results only make sense when you add terms to the search itself.
The reason lies in the analysis of the language, which is integrated into the search engine's utilization process. This analysis looks at the relationships between the words based on various criteria, which are then used to refine the results. Computers also appear to understand what you are looking for by providing you with exact and varied results even if typos are present within your search terms.
Although the user interface (e.g., an input screen) often appears quite mundane, the process behind these searches has little in common with a simple phrase or keyword search [1]. In the past, a query didn't find files or documents unless the correctly chosen (and correctly spelled) search terms were used.
However, search engines have evolved during the past 20 years into sophisticated language processors in which many criteria play a vital role. Search engines use more than 50 different processes including the language of the document, the format and structure, technical terms within the metadata that register how often a document is referenced, or a link itself (e.g., degree of network integration).
The result then is based on concepts that belong together thematically. In addition to a large dose of statistics, knowledge about the linguistic context of the individual words plays a major role. This originates from the linguistic components of the thesauri (see the "Origins" box).
Origins
The term thesaurus comes from the ancient Greek word thesauros , which means treasure or treasure trove . Unsurprisingly, the word for thesaurus in Latin is also "thesaurus" and refers to a thematically ordered collection of certain criteria in which objects are related one to another – a "storehouse of knowledge" if you will. A thesaurus is essentially a dictionary based on linguistics and information science, that compiles the entire vocabulary of a language.
In the 1950s, the thesaurus became a specialized reference book with controlled, limited vocabulary pertaining to individual words as well as their relations to other words. The basis for the vocabulary used in a thesaurus was developed under the authority of the German National Library [2] and the Library of Congress Subject Headings (LCSH) [3]. Synonyms, as well as generic terms and narrower terms, are primarily used. The relationships between the terms are standardized in accordance with DIN 1463-1 (more specifically ISO 2788) and are known as associations and references.
Current examples include the thesaurus Linguae Latinae (abbreviated ThlL or TLL) [4] for Latin, Linguae Graecae (TLG) [5] or the UNESCO Thesaurus [6].
The latter is more of a collective work in the area of education, science, culture, social sciences and humanities, information and communication sciences, political science, law, and economic sciences. All of the entries are in English, French, Spanish, and Russian. The European Thesaurus on International Relations and Area Studies [7] and the Getty Thesaurus of Geographic Names (TGN) [8] are also very useful. These are now available as open data for anyone who is interested (Figure 1).
Linguistic thesauri are based on applied linguistics and use mindmaps to organize vocabulary. The main goals for this are to display the invisible connections (semantics) between words of different origins and to illustrate the similarity (relations and associations) between each word.
Additionally, thesauri serve to explore the history of language and to determine specific meanings and their history. Thesauri assist people in day-to-day life as dictionaries of synonyms. These references can help create you more elegant expressions and may also help you become more skillful in a given language.
In IT, thesauri often come bundled with word processors, as well as with search engines. They often form the basis for spellcheckers and reinforce assistance with proper grammar. KThesaurus [9] and OpenThesaurus for LibreOffice [10] are just a couple examples of the practical application of thesauri.
The Swiss website Lexikon.ch [11] offers an introduction to research in German-speaking countries. This offers a special search engine for encyclopedias, thesauri, dictionaries, quotation collections, abbreviations, and rhyming dictionaries. The site catalogs both commercial and free projects.
Woxikon [12], Leo [13], and Beolingus/Dict [14], among others, are offered solely online. Woxikon and Leo offer supplemental information for Slavic, Romance, and Scandinavian languages. Beolingus/Dict focuses on English, Spanish, and Portuguese. Leo and Beolingus/Dict were developed through the scientific endeavors of TU Munich and TU Chemnitz, and they work in cooperation with the French National Center for Textual and Lexical Resources (CNRTL) [15] in Nancy (Lothringen) as well as with OpenThesaurus [16] and WordNet [17].
Commercial thesauri as resources were traditionally presented in book form (e.g., Oxford English Dictionary, Roget's, German Standard Edition of Duden). Most publishers integrate their thesauri into their online editions and make it possible to use on a web browser or mobile app. Admittedly, these websites are more oriented toward occasional users.
The publishers provide an interface (API) for unlimited use and integration into your own application. Working with this requires that you register and acquire an API key. You will need to use this key for every query.
The Macmillan Dictionary [18], Merriam-Webster [19], and Cambridge Dictionaries Online [20] provide their results in XML Data or Javascript Object Notation (JSON) and comply with current, accepted web standards. Listing 1 shows a query on Merriam-Webster and Listing 2 displays the subsequent search result.
Listing 1
Merriam-Webster Query
http://www.dictionaryapi.com/api/v1/references/thesaurus/xml/umpire?key=API-Key
Listing 2
Merriam-Webster Result
<entry id="umpire"> <term> <hw>umpire</hw> </term> <fl>noun</fl> <sens> <mc>a person who impartially decides or resolves a dispute or controversy</mc> <vi>usually acts as <it>umpire</it> in the all-too-frequent squabbles between the two other roommates</vi> <syn>adjudicator, arbiter, arbitrator, referee, umpire</syn> <rel>jurist, justice, magistrate; intermediary, intermediate, mediator, mediatrix, moderator, negotiator; conciliator, go-between, peacemaker, reconciler, troubleshooter; decider</rel> </sens> </entry>
The competing dictionary publishers Pons and Langenscheidt take two slightly different approaches. Pons provides a connection to an in-house database as a standalone service [21]. Langenscheidt, on the other hand, focuses their offerings in the form of print books and specific apps for different mobile devices.Visual Thesaurus [22] allows users to take a look at the facets of individual words by the way of graphs. These graphs show the connections between words visually as individual nodes and edges in a web browser (Figure 2).
The image is created with Javascript and can be rotated in any direction, by clicking on the desired node. If you don't have an API key, you may make only a limited number of queries. Enough queries are generally provided to give the user an impression of the service.
The commercial product Wordnik [23] is a kind of enhanced dictionary for the English language that focuses particularly on the issue of mobile devices. The range of functions includes descriptions, explanations of the meanings of the word, as well as a large number of examples. Wordnik combines results from many different sources simultaneously (i.e., from Wiktionary or WordNet). See Figure 3.
All modules are available under the Apache license. The source code can be found in a GitHub repository. The connection is possible via various modules and interfaces for Python, Ruby, JavaScript, Java, and PHP. The application requires registration by the manufacturer before you receive the corresponding API key.
Many thesaurus projects are available in free software. Although most are not as well known as their proprietary counterparts, they often manage to be as feature-rich. The aforementioned WordNet is the result of a research project by the same name from Princeton University. The institution has been working on an English lexicographical database for several decades.
The database groups nouns, verbs, adjectives, and adverbs at the semantic and lexical level. The project forms the basis for comparative linguistics and natural language processing, and is therefore the basis for several of the programs presented here.
In addition to a web-based interface, the current state of research for various platforms is also available (as an Ubuntu package [24], among others). This includes both a wn command-line program, as well as a graphical application called WordNet Browser.
With the query in Listing 3, you can gain insight into the synonyms for and meaning of the substantive "fair." The parameter -synsn stands for and selects synonym, whereas n does the same for substantives (English nouns). Using the command wnb , you can start the GUI program and type in the search box on the top left.
Listing 3
Synonyms for "fair"
01 $ wn fair -synsn 02 03 Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun fair 04 05 4 senses of fair 06 07 Sense 1 08 carnival, fair, funfair 09 => show 10 11 Sense 2 12 fair 13 => gathering, assemblage 14 15 Sense 3 16 fair 17 => exhibition, exposition, expo 18 19 Sense 4 20 bazaar, fair 21 => sale, cut-rate sale, sales event
Below the input box, four buttons appear that display the respective available word form. To restrict the list on synonyms for nouns, click the "Noun" button and select "Synonyms, ordered by estimated frequency" from the list. The result (Figure 4) is identical to the output on the command line.
Several implementations exist for WordNet and are listed on the project website. To use Perl, it is best to use the WordNet-QueryData module [25], which is available as an Ubuntu package libwordnet-querydata-perl .
For Python, the Python Natural Language Toolkit (NLTK) is a good choice [26]. The latter provides a suitable parsing class for WordNet.
Kthesaurus (Figure 5) provides similar functions for Calligra-Suite (formally KOffice) as OpenThesaurus does for LibreOffice.
The lexical information gets extracted from the WordNet databank. Because of this, Kthesaurus is only available in English. To use the software, install the package for your distribution.
In the box in the top left, first enter the word you want and scroll over the Search button to search within the database. Then, under the Thesaurus tab, you will see three columns filled with synonyms (column 1), hypernyms (column 2), and hyponyms (column 3).
The Replace button replaces the word in the text with the selections (note that this is only possible if you have activated Kthesaurus from within Calligra-Office). You can change the search vocabulary by selecting one of the entries from the columns with a double click. By using tabs, you can switch between the original search and the entry from the WordNet databank.
Figure 6 displays the overview for the word "help" and is sorted according to the average frequency. Use the drop-down box to get more information in accordance with the way these are stored in the WordNet databank (i.e., by compound words, synonyms, antonyms, and everyday words).
GoldenDict [27] combines different sources under a unified user interface. The desktop application is based on Qt and Webkit-Framework [28]. The Ubuntu program package can be found under goldendict .
This program serves as an interface to various dictionaries and data sources including Wikipedia, Wiktionary, and WordNet in order to derive and compile the information. As a user, you can specify which program channels to evaluate in the settings. In the input box at the top left, you must first enter a search item. Then, similar items from which you can select will appear in the left column. In the middle column, you will see the search results, and in the right column you will see the sources the app used (Figure 7).
The term "fair" was used in the example query for Wikipedia in both German and English, as well as on the WordNet databank. The latter is shown as an edited search result here.
To use a thesaurus based on WordNet, you will need the goldendict-wordnet package from the package sources, which will let you connect the two projects together.
Wiktionary [29] is one of the more impressive Wikipedia offshoots with its now more than 370,000 entries in more than 200 languages.
Although Wiktionary is primarily used as a dictionary, it records hyphenation rules, pronunciation, meaning, origin, synonyms, and antonyms (Figure 8). Additionally, Wiktionary is used mainly from a web browser, although there is an API to integrate it within your own projects online [30].
This article provides an overview of different thesauri, which are available either commercially or for free. Whether you are doing a quick search in your web browser or via the command line, there is a thesaurus for everybody out there.
Infos