Language Support#
Currently, Label Sleuth can work with text data in the following languages:
ENGLISH
default ITALIAN
ROMANIAN
HEBREW
ARABIC
To start up the system with your chosen language, use the following command:
python -m label_sleuth.start_label_sleuth --language <YOUR_LANGUAGE>
Note that not every machine learning model is compatible with every language. For model-language compatibility, see here.
Adding support for a new language#
The system can easily be extended to support additional languages, and we encourage developers who are fluent in additional languages to contribute them to Label Sleuth.
To support a new language, follow the steps below:
Find your desired language in the page for FastText word vectors
Assuming your language is listed on this webpage, you will need to check for the 2- or 3-letter language code associated with the language.
This can be done by looking at the download links that appear next to your desired language. The language code can be found within the template
cc.{XX}.300
in the download link. For example, the download link for Nepali is https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ne.300.bin.gz, indicating that the appropriate code isne
. This code is thefasttext_language_id
.Compile or find a list of stop-words for the language
Stop-words are words that are considered less “meaningful” (in English, for instance, a word like “such” is considered a stop-word, as it carries very little semanatic meaning compared to words like “kitchen” or “celebrate”). For this reason, stop-words are often ignored by automated language systems, including some components that can be found within Label Sleuth.
Bear in mind that the specific set of stop-words you choose is not so crucial; if in doubt, you can go for a very short list of words that come to mind.
Create a Language object
Each language is defined by a Language instance in languages.py. Create a new object for your language, filling in the name of the language as well as the information from steps 1-2. For example, this is the language object for Arabic:
Arabic = Language(name='Arabic', stop_words=["التى", "التي", "الذى", "الذي", "الذين", "ذلك", "هذا", "هذه", "هؤلاء", "قد", "وقد", "حيث", "ان", "إن", "انه", "وان", "فان", "فإن", "بان", "اي", "أي", "ايضا", "أيضا", "إياه"], fasttext_language_id='ar', right_to_left=True)
Note that in this particular case an additional parameter of
right_to_left
is specified; this is only necessary for languages that use a right-to-left writing direction.Add the object to the
Languages
classlanguages.py also contains a
Languages
class which holds all the languages supported by the system. Simply add your newly-created language object toLanguages
.Try it out
All done! As specified on the top of this page, you can now start up Label Sleuth to use your chosen language, with
python -m label_sleuth.start_label_sleuth --language <YOUR_LANGUAGE>
(It may take a little longer to start up the system for the first time, as the system downloads the necessary files for this newly-added language).Once Label Sleuth has started up using the chosen language, you can load documents in this language and work with the system as usual. Try out the system in the new language for a few model training iterations to make sure that the language extension works and the system can learn a model in the new language.
Be sure to open a pull request so that fellow language speakers could use it!