To each stored canonical form are associated the following attributes: the grammatical category, the flexion, the number of basic inflections (gender and number) that it has, the number of syllables, the position of the tonic syllable, the number of etymologies, the number of total meanings, the number of meanings for the grammatical category, the frequency of occurrence in the CREA, seniority, among other data related to the meanings.
The grammatical categories of words are a key element in the treatment of natural language. Therefore, as many equal canonical forms have been stored as grammatical categories can play a word. Also, the etymology is a factor that influences when it comes to storing the same canonical forms, given that they affect the lexico-genetic relations that the words have with each other and should not be confused in the same entry. For example, the nouns considered are: noun, toponymic noun, patronymic noun, anthroponymic noun, noun proper, abbreviation used as noun, acronym used as noun, symbol used as noun, foreign noun, noun numeral cardinal, noun numeral ordinal and noun fractional numeral. The adjectives have been classified in 14 groups, the adverbs in 21 groups, the pronouns in 14 groups, the conjunctions in 16 groups. Articles, prepositions, contractions, interjections, onomatopoeias, expressions and locutions are also stored. This classification has led, for example, to the following entries in the Lexicon TIP:
coca: Six feminine nouns corresponding to six different etymologies, a toponymic noun and a patronymic noun. All entries have the corresponding attributes and push-ups.
cuando: A noun, a preposition, two adverbs and five conjunctions. All entries have the corresponding attributes and push-ups.
Under this criterion the Lexicon TIP consists of:
|Canonical forms||259 399|
|Unique canonical forms||226 104|
And the distribution by grammatical categories is:
|Grammatical category||Canonical form|
Each canonical form has stored all the corresponding flexions, regardless of their frequency of use in Spanish. The flexions considered are: gender, number, neutral, superlative, diminutive, augmentative and derogatory. For each one of them, three levels are distinguished in the corpus: regular flexion, irregular flexion and very irregular flexion. In addition, words that are common or ambiguous in terms of gender or number have been tagged. For each flexed shape, its frequency is stored in the CREA. Therefore, the corpus has many similar words stored but related to different canonical forms. According to this criterion, the Lexicon TIP consists of:
|Words||6 334 405|
|Unique words||4 361 506|