NLP經典概念總結

the task of converting a raw text file, essentially a sequence of digital bits, into a well-defined sequence of linguistically meaningful units.

文本預處理是NLP中的基本步驟，在這壹步驟中，主要完成字符、單詞、句子的識別任務。文本預處理又可以分成兩個階段，document triage 和 text segmentation。

Document Triage 將文件轉化成定義明確的文本。它包含以下三個步驟：

?Step 1: 字符編碼識別（character encoding identification）

?Step 2: 語言識別（language identification）

?Step 3: 文本解剖（text sectioning）：識別文本的有用主體部分，去除無用元素，如圖表、鏈接、HTML標簽等。

Text Segmentation 將文本轉化為單詞和句子。它包含以下幾個部分。

?1) word segmentation 也叫tokenization，即分詞。

?2) text normalization 文本規範化，比如將“Mr.”, “Mr”, "mister", "Mister"規範化成壹種形式。

?3) Sentence segmentation 即句子劃分。

A basic task of lexical analysis is to relate morphological variants to their lemma that lies in a lemma dictionary bundled up with its invariant semantic and syntactic information.

詞法分析的壹個基本任務是基於詞元詞典（lemma dictionary）進行詞形還原，例如{delivers, deliver, delivering, delivered}.

詞性標註(part-of-speech tagging) 也是詞法分析的壹個重要應用，常將詞性標註的結果作為後續句法分析的輸入。

A basic techniques for grammar-driven natural language parsing, that is, analyzing a string of words (typically a sentence) to determine its structural description according to a formal grammar.

句法分析，壹種語法驅動的句子解析，包含兩個任務，phrase structure parsing 和 dependency parsing。

phrase structure parsing 旨在劃分句子的結構化單元。

dependency parsing 旨在挖掘單詞之間的語法依存關系。比如，主語、謂語等。

下圖展示了兩種任務之間的區別。

shallow syntactic parsing 分析句子成分，主謂賓等。

chunker 是壹種基於依存句法分析的句子劃分方法。

e.g. Santa Claus delivers toy to Child. 可以對此句做出如下的劃分。

? Action: delivers toy to Child

? Initiating Actor: Santa Claus

? Business Entity: toy

? Responding Actor: Child

Poesio於 2000年在《 Handbook of Natural Language Processing》第壹版中曾對語義分析給出了如下定義： The ultimate goal, for humans as well as natural language-processing (NLP) systems, is to understand the utterance—which, depending on the circumstances, may mean incorporating information provided by the utterance into one’s own knowledge base or, more in general performing some action in response to it. ‘Understanding’ an utterance is a complex process, that depends on the results of parsing, as well as on lexical information, context, and commonsense reasoning. . .

to be continued.........