20.3 The Challenge of Human Language#
Despite advancements, NLP faces several challenges due to the complexity of human language. Understanding human language is much harder than processing numbers or images. Why? Because language is messy.
Key Challenges in NLP#
Despite advancements, NLP faces several challenges due to the complexity of human language:
1. Ambiguity#
Type |
Description |
Example |
|---|---|---|
Lexical Ambiguity |
Words with multiple meanings. |
“Bank” (financial institution vs. river edge) |
Syntactic Ambiguity |
Sentence structures with multiple interpretations. |
“I saw the man with the telescope” (who had the telescope?) |
Semantic Ambiguity |
Phrases requiring context for disambiguation. |
“Time flies like an arrow” (time moves quickly vs. insects resembling arrows?) |
2. Context Understanding:#
Sentences change meaning based on situation. E.g., cold ice cream vs. cold soup.
Human language depends on tone, sarcasm, and cultural references, which are hard for machines to interpret.
Example:
Sincere: “That was a great movie!” (positive)
Sarcastic: “That was a great movie…” (negative, with ironic tone)
3. Slang & Informal Text#
Handling emojis, abbreviations (e.g., “BRB,” “LOL”).
4. Data Sparsity & Rare Words#
Problem: Many words appear infrequently in training data, making it hard for models to learn their meanings.
Solutions:
Subword Tokenization: Break words into smaller units (e.g., Byte Pair Encoding)Pre-trained Embeddings: Use models like Word2Vec or GloVe
5. Multilingual & Low-Resource Languages#
Problem: Bias toward high-resource languages (English, Chinese)as most NLP models are trained on these high-resource languages, leaving many languages underrepresented.
Solutions:
Multilingual Models: mBERT, XLM-RTransfer Learning: Adapt pre-trained models
6. Morphological Complexity#
Different languages have different grammar rules.
Language |
Challenge |
Example (Analysis) |
|---|---|---|
Finnish |
Agglutinative morphology |
“Talossanikin” = “talo-ssa-ni-kin” (house-in-my-too) |
Arabic |
Root-and-pattern system |
“كتب” (k-t-b) → “kataba” (wrote), “kitāb” (book) |
Turkish |
Vowel harmony |
“evlerimizde” (in our houses) → 4 suffixes |