Free/open KACSTAC and its processing tools: Lexical resources for Arabic lexicogrammatical microstructures based on collocational indicators
Almujaiwel, Sultan . 2016
Open/free Arabic corpora and Arabic corpus tools are new types of Arabic resources for both corpus-based studies and Arabic language resources ALRs evaluation. This paper endeavours to apply corpus linguistics (CL) to the information organisation of the entries in Arabic lexicography using collocational indicators (systematic collocational, colligational and semantic set indicators). The use of large-scale corpora in which the quantity of tokens and types is large enough to provide empirical evidence and statistical probabilities is essential. The rationale behind applying the large-scale KACST Arabic Corpus is that the investigation into the linguistic behaviours and the contextual consequences of Arabic lexical entries require an examination of a wide range of word occurrences. The methodology proposes that the information of one entry and of its derivational/rich-morphological forms and syntactic and semantic sets in the Arabic Lexicon Corpus ALC is processed in Ghawwas (a standalone Arabic corpus processing software) in order for such information to be retrieved. The resulting findings are compared with the real contextualization of such information as appearing in 700M KACSTAC. The aim of this is to show the gap between the existing lexicographical Arabic words and the real standard Arabic usage since that corpus contains the classic and standard varieties of Arabic. Notwithstanding that the information from such sets in lexicographical works cannot represent real use in natural language. This method shows the extent of the derivations and reflects on (or) extends how one single root is compatible with further dimensions of contextual behaviours of lexeme sets