Metadata-Version: 1.0
Name: text-hr
Version: 0.13
Summary: Morphological/Inflection Engine for Croatian language, POS tagger, stopwords
Home-page: http://bitbucket.org/trebor74hr/text-hr/
Author: Robert Lujo
Author-email: trebor74hr@gmail.com
License: UNKNOWN
Description: Morphological/Inflection Engine for Croatian language
        =====================================================
        "text-hr" is Morphological/Inflection Engine for Croatian language written in
        Python programming language. Includes stopwords and Part-Of-Speech tagging
        engine (POS tagging) based on inverse inflection algorithm for detection.
        
        Since API is not freezed, this project is still in alpha.
        
        TAGS
        ----
        Croatian language, python, natural language processing (NLP),
        Part-of-speech (POS) tagging, stopwords, inverse inflection,
        morphological lexicon
        
        
        OZNAKE
        ------
        Hrvatski jezik, Python biblioteka, morfologija, infleksija, obrnuta
        infleksija, prepoznavanje vrsta riječi, računalna obrada govornog jezika,
        zaustavne riječi, morfološki leksikon
        
        AUTHOR
        ======
        Robert Lujo, Zagreb, Croatia, find mail address in LICENCE
        
        
        FEATURES
        ========
        To name the most important:
        - inflection system - for producing all forms of one word
        - detection of word types (POS tagging) - from existing list of word forms
        - list of stopwords
        
        System is based on unicode strings, default codepage to convert from and to
        string is cp-1250.
        
        Check `Getting started`_.
        
        INSTALLATION
        ============
        Installation instructions - if you have installed pip package
        http://pypi.python.org/pypi/pip::
        
        pip install text-hr
        
        If not, then do it old-fashioned way:
        - download zip from http://pypi.python.org/pypi/text-hr/
        - unzip
        - open shell
        - go to distribution directory
        - python setup.py install
        
        
        GETTING STARTED
        ===============
        There are three important parts that this project provides:
        - `Inflection system`_ - for producing all forms of one word
        - `Detection of word types (POS tagging)`_ - from existing list of word forms
        - `List of stopwords`_
        
        Inflection system
        -----------------
        Usage example - start python shell::
        
        >>> from text_hr import Verb
        >>> v = Verb("platiti")
        >>> for k in sorted(v.forms.keys()):
        ...     print k, v.forms[k]
        ...
        AOR/P/1 [u'platismo']
        AOR/P/2 [u'platiste']
        AOR/P/3 [u'plati\u0161e']
        AOR/S/1 [u'platih']
        AOR/S/2 [u'plati']
        AOR/S/3 [u'plati']
        IMP/P/1 [u'platasmo', u'pla\u0107asmo', u'platijasmo']
        IMP/P/2 [u'plataste', u'pla\u0107aste', u'platijaste']
        IMP/P/3 [u'platahu', u'pla\u0107ahu', u'platijahu']
        ...
        VA_PA//P_O+S+V+N [u'pla\u0107eno']
        X_INF// [u'platiti']
        X_VAD_PAS// [u'plativ\u0161i']
        X_VAD_PRE// [u'plate\u0107i']
        X_VAD_PRE// [u'plate\u0107i']
        
        Detection of word types (POS tagging)
        -------------------------------------
        TODO: to be done - check test_detect.txt for samples, and detect.py for the logic:
        
        First example in test_detect.txt::
        
        >>> from text_hr.detect import WordTypeRecognizerExample
        >>> def test_it(word_list, wt_filter=None, level=2):
        ...     wdh = WordTypeRecognizerExample(word_list, silent=True)
        ...     if not wt_filter is None:
        ...         wdh.detect(wt_filter=wt_filter, level=level)  # e.g. wt_filter=["N"]
        ...     else:
        ...         wdh.detect(level=level)  # all word types
        ...     lines_file = LinesFile()
        ...     wdh.dump_result(lines_file) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS
        ...     print "\n".join(lines_file.lines)
        ...     return wdh
        
        >>> class LinesFile(object):
        ...     def __init__(self):
        ...         self.lines = []
        ...     def write(self, s):
        ...         self.lines.append(repr(s.rstrip()))
        
        >>> word_list = [
        ...   "Broj    84"
        ... , "broji   34"
        ... , "Brojila  28"
        ... , "broje   23"
        ... , "brojeći 22"
        ... , "brojim   7"
        ... , "brojimo  5"
        ... , "brojiš   4"
        ... , "brojahu  2"
        ... , "brojaše  1"
        ... , "brojite  1"
        ... , "-brijestovu 1"
        ... , "brijestovi 1"   #the only one checked with endswith, but all other will be checked with get_freq
        ... , "-brijestove 1"
        ... , "-brijestova 1"
        ... ]
        
        Lowest quality, but fastest
        >>> wdh = test_it(word_list, level=4) # doctest: +ELLIPSIS
        " 10/  183 -> brojati              (u'V-XX_-_JATI-je\\u0107i-0') 84/broj,34/broji,23/broje,22/broje\xe6i,7/brojim,5/brojimo,4/broji\x9a,2/brojahu,1/brojite,1/broja\x9ae"
        
        List of stopwords
        -----------------
        Is located in std_words.txt, and you can read it directly from here
        
        http://bitbucket.org/trebor74hr/text-hr/src/tip/text_hr/std_words.txt
        
        The list can be updated like this::
        
        >>> import text_hr
        >>> text_hr.dump_all_std_words()
        Totaly 2904 word forms dumped to r:\hg-clones\python\text-hr\text_hr\std_words.txt in codepage utf8
        
        Iteration over all words goes like this::
        
        from text_hr import get_all_std_words
        
        for word_base, l_key, cnt, _suff_id, wform_key, wform in get_all_std_words():
        print word_base, l_key, cnt, _suff_id, wform_key, wform
        
        
        Further
        -------
        Since there is currently no good documentation, the best source of
        further information is by reading tests inside of modules and
        tests in tests directory (dev version). More information in `Running tests`_.
        You can allways read a source.
        
        
        DOCUMENTATION
        =============
        Currently there is no documentation. In progress ...
        
        
        SUPPORT
        =======
        Since this project is limited by my free time, support is limited.
        
        
        REPORT BUG OR REQUEST FEATURE
        -----------------------------
        If you encounter bug, the best is to report it to the bitbucket web page
        http://bitbucket.org/trebor74hr/text-hr.
        
        If there will be an interest for development for other inflection rich
        languages, I'd be glad to decouple language specific code and create new
        project that will be capable to deal with multiple languages.
        
        The best way to contact me is by mail (find in LICENCE).
        
        TODO list is in readme.txt (dev version).
        
        
        CONTRIBUTION
        ============
        Since this project is not currently in the stable API phase, contribution
        should wait for a while.
        
        
        RUNNING TESTS
        =============
        All tests are doctests (not unittests). There are three type of tests in the
        package:
        
        1. doctests in each module - e.g. in verbs.py
        2. doctests in tests/test_*.txt - only development version
        3. tests which are not automatically compared - i.e. in special call mode
        detect.py can produce output file which needs to be compared
        manually with some existing file. Such test(s) are very slow. This needs
        to be changed to be automatic.
        
        Running each module directly will run 1. and 2. if running from development
        version. To get development version
        To use development version (http://bitbucket.org/trebor74hr/text-hr)::
        
        hg clone https://trebor74hr@bitbucket.org/trebor74hr/text-hr
        
        
        create text_hr.pth in python site-packages directory with path to text-hr e.g.::
        
        r:\hg-clones\python\text-hr
        
        To run all tests:
        - go to tests directory
        - run tests.py like (with sample output)::
        
        > python tests.py
        testing module   __init__
        testing module   adjectives
        ...
        testing textfile R:\hg-clones\python\text-hr\tests\test_adj.txt
        ...
        testing textfile R:\hg-clones\python\text-hr\tests\test_verbs_type.txt
        
        To run tests for just one module:
        - goto text_hr directory
        - run tests by running module, e.g.::
        
        > py pronouns.py
        __main__: running doctests
        ..\tests\test_pronouns.txt: running doctests
        
        - in the case you're not running from dev version, you'll get output like
        this::
        
        > py pronouns.py
        __main__: running doctests
        ..\tests\test_pronouns.txt: Not found, skipping
        
        TODO
        ====
        various things, see readme.txt for details.
        
        CHANGES
        =======
        0.13
        ----
        ulr1 100610:
        - text_hr package reorganized (__init__.py with __all__ and imports ...)
        - word_types.py removed
        - std_words.txt
        
        0.12
        ----
        ulr1 100608 :
        - README
        - enabled tests from tests.py for all
        - enabled tests from directly from each modules
        
        0.11
        ----
        ulr1 100607:
        - recreated repo at bitbucket
        - no .suff_registry.pickle and testing_*.out put in zip
        
        0.10
        ----
        ulr1 100605:
        - first installable release
        
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Natural Language :: English
Classifier: Natural Language :: Croatian
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: Indexing
