Logo GenDocs.ru

Поиск по сайту:  


Загрузка...

Программa - tokenizer, putzer, htmlEnt2Char инструменты для корпусной лингвитики - файл README.txt


Загрузка...
Программa - tokenizer, putzer, htmlEnt2Char инструменты для корпусной лингвитики
скачать (2711.4 kb.)

Доступные файлы (73):

en.lolita.utf8.txt652kb.15.10.2011 20:23скачать
en.lolita.utf8.txt.tok
example.sh
ru.lolita.utf8.txt1276kb.15.10.2011 20:40скачать
ru.lolita.utf8.txt.tok
README.txt9kb.26.10.2011 02:24скачать
aclocal.m4
config.h
config.h.in
config.log
config.status
configure
configure.ac
COPYING
depcomp
htmlEnt2Char.Po
putzer.Po
TokenizeDeL1.Po
TokenizeDeU8.Po
TokenizeDeWin.Po
TokenizeEnL1.Po
TokenizeEnU8.Po
TokenizeEnWin.Po
tokenizer.Po
TokenizeRuI5.Po
TokenizeRuU8.Po
TokenizeRuWin.Po
htmlEnt2Char.c
htmlEnt2Char.l
INSTALL
install-sh
LC_ascii.h
LC_cp1251.h
LC_cp1252.h
LC_ISOcyrillic5.h
LC_ISOlatin1.h
LIESMICH
Makefile
Makefile.am
Makefile.in
missing
mkinstalldirs
putzer.c
putzer.l
README
stamp-h1
TokenizeDe.l
TokenizeDeL1.c
TokenizeDeL1.l
TokenizeDeU8.c
TokenizeDeU8.l
TokenizeDeWin.c
TokenizeDeWin.l
TokenizeEn.l
TokenizeEnL1.c
TokenizeEnL1.l
TokenizeEnU8.c
TokenizeEnU8.l
TokenizeEnWin.c
TokenizeEnWin.l
tokenizer.c
Tokenizer.h
TokenizerLang.h
TokenizerLexer.h
TokenizeRuI5.c
TokenizeRuI5.l
TokenizeRu.l
TokenizeRuU8.c
TokenizeRuU8.l
TokenizeRuWin.c
TokenizeRuWin.l
torture_de_l1.txt3kb.04.05.2007 11:14скачать
ylwrap

README.txt


===================================================================
tokenizer, putzer, htmlEnt2Char -- three tools for corpus processing
===================================================================

Preface:

	I tried to write a fast, rule-based, and also to some extend robust
	tokenizer and sentence segmenter.

	Actually supported languages are:
		* German (see also file LIESMICH)
		* English (thanks also to Michaela Geierhos)
		* Russian
	For each language the corresponding codepage of ISO and MS-Windows,
	and partly UTF-8.

Install:

	Compile and install (see ./tokenizer-1.0/INSTALL):
		$> cd ./tokenizer-1.0/
		$> ./configure
		$> make
		#> make install
			OR
		$> sudo make install

Usage:

	===========================================================================

	tokenizer    -- a tokenizer whith end-of-sentence detection.

		tokenizer OPTIONS [FILES]

		options:
			-o    output filename
			-L     language in a specific charset, actually supported:
							de german   (iso-8859-1)
							de-win german-win-cp1252 (cp1252)
							de-u8 german-utf8 (rudimentary support for utf-8)
							en english  (iso-8859-1)
							en-win english-win-cp1252 (cp1252)
							en-u8 english-utf8 (rudimentary support for utf-8)
							ru russian  (iso-8859-5)
							ru-win russian-win-cp1251 (cp1251)
							ru-u8 russian-utf8 (rudimentary support for utf-8)
			-S           enable end-of-sentence detection
			-E      specify EOS-mark (default: "")
			-n           treat a new line as EOS
			-N           treat two or more new lines (paragraph break) as EOS
			-c           combine continuation I: hyphenated words on line breaks
							will be put together. The hyphen is skipped.
			-C           combine continuation II: same as above, but the hyphen is
							preserved. This may be a good option if you know that
							there are no hyphenated words, but `bindestrichwoerter'
							(like end-of-sentence) in your text.
			-W           detect www-adresses and treat them as one token
			-i | -l      convert all tokens to lowercase
							(according to language settings, not for utf-8)
			-s           single line mode: each token on a separate line
			-X      use  as separator in single line mode instead .
							Original newlines are preserved because putting the whole
							input in one line isn't a good idea
			-p           paragraph mode:
							two or more newlines are interpreted as a paragraph
							break, a single newline will not. All lines
							of one paragraph are collected in one line
			-P           prints each sentence in a separate line
			-x           print spaces:
							In single line mode horizontal spaces will be printed
							as one space in a single line, vertikal spaces as
							two line breaks.
							In paragraph mode (including combination with -P)
							an additional newline is inserted between paragraphs
			-h | -?      print this help and exit

		Other arguments will be read as input filenames.
		If no input files are given, input is read from stdin.
		If no output file is given, the tokenized text is written to stdout

		WARNING: When the input contains long words or many following newlines
		tokenizer stops with "input buffer overflow". To avoid this use putzer
		(included in your package) with option -m  as filter!

		tokenizer, v1.0, Sebastian Nagel (wastl@cis.uni-muenchen.de)

	=============================================================================


	putzer		--  clean files: remove double spaces, spaces at beginning
					or end of line, double newlines and ... (see options)

		putzer [OPTIONS] [FILES]

		options:
			-i    input filename
			-o    output filename
			-c           combine continuation: hyphenated words on line breaks
							will be put together
			-l           convert to lowercase (latin-1)
			-m      maximal word length in chars:
						longer words will be stripped
			-q           quiet: don't report errors
			-h | -?      print this help and exit
		Other arguments will be read as input filenames.
		If no input files are given, input is read from stdin.
		If no output file is given, the tokenized text is written to stdout

		putzer, Sebastian Nagel (wastl@cis.uni-muenchen.de)

	===========================================================================

	htmlEnt2Char -- converts HTML-entities into characters

		htmlEnt2Char [OPTIONS] [FILES]
		options:
			-C    output encoding, actually supported:
								l1 lat1 latin1 iso-8859-1
								u8 utf-8 (default)
			-o    output filename
			-f           force: skip misspelled entities or
						entities not printable in given charset
						(see also -r or -R)
			-r   replace unrecognized/unprintable entities
						by 
			-R      replace unrecognized/unprintable entities
						by a character given as , a Unicode code point
						Interpretation of  follows the C convention:
						0x.... for hexadecimal numbers
						0....  for octal numbers
						....   for decimal numbers
			-q           quiet: don't report errors, misspelled
							entities etc.
			-h | -?      print this help and exit
		Other arguments will be read as input filenames.
		If no input files are given, input is read from stdin.
		If no output file is given, the text with replacements
		is written to stdout.

		htmlEnt2Char, Sebastian Nagel (wastl@cis.uni-muenchen.de)

	===========================================================================

Features:

	1. customizable through options
		- language and codepage
		- try to undo hyphenation
		- semantics of line breaks (paragraph separator or not)
		- etc.

	2. problems and strategies for tokenization
		- hyphenated words are considered as one token
		- option -c concatenates words with hyphen at end-of-line.
	This may cause errors, although a small exception list is defined

		3. end-of-sentence detection:
		- positive:
			* end-of-sentence marker followed by blank and uppercase letter
		- negative:
			* abbreviations (except for, e.g., "etc." which often occurs at EOS)
			* dates
		- positive:
			* negative followed by word usually used exclusively at BOS
				* capitalized determiners, conjunctions, etc.
		- try to handle additional punctuation symbols following the full-stop
		correctly (brackets, apostrophes etc.)
		- tests on the Brown corpus support an error rate of about 3%

	===========================================================================

Version history:

	0.1 -- package with tokenizer, putzer, htmlEnt2Char

	0.2 -- bug reported by js: when input contains long words or many following newlines
		tokenizer stops with "input buffer overflow". To avoid this use putzer as
		filter with newly introduced option -m!

	0.3 -- optimization (inlines & macros): now about 10% faster

	0.4 -- corrected some details in German EOS-detection, changed behaviour with option -sx:
		When a newline is recognized a space is printed on a separate line,
		instead of an empty line.

	0.5 -- ':' now not considered as EOS-mark. Additions to German abbreviation list.

	0.6 -- Added more German abbreviations, Roman numerals with point.
		Added rudimentary support for utf-8 in German.

	0.7 -- Better EOS for English, thanks to Michaela Geierhos;
		rudimentary support for utf-8 in English

	0.8 -- fixed a bug in the Russian part, which makes the tokenizer hanging

	0.9 -- changes to German abbreviations
		rudimentary support for utf-8 in Russian

	0.10 -- fixed bug raising a segfault for German language option
			short sequences in parenthesis are excluded from containit an end-of-sentence
			additions to German abbreviations

	0.11 -- fixed a bug with options -C and -c.
			Introduced positive rules for German EOS : i.e. if a capitalized article, conjunction
			or prepositions follows an abbreviation or date, there should be an EOS.

	0.12 -- better documentation (in English)
			Positive rules also for English: The text ГЇВїВЅThe firm said it
			plans to sublease its current headquarters at 55 Water St. A
			spokesman declined to elaborate.ГЇВїВЅ (Wall Street Journal) is now
			correctely splitted into two sentences

	1.0  -- (almost) no changes
			GPL licensed now

 

Поиск по сайту:  

© gendocs.ru
При копировании укажите ссылку.
обратиться к администрации
Рейтинг@Mail.ru