Introduction to Natural Language Processing

With the recent advancement of AI chatbots generative AI captured the most magnanimous attention in the technology field, specifically in artificial intelligence (AI). Everyone is talking about Generative AI. In this blog we will talk about the core of Generative AI chatbots: Natural Language Processing (NLP).

We will cover an introduction to NLP, the applications of NLP in AI, some ideas on feasible products and some next generation NLP ideas, meaning of the word personalized/customized and why this is the most important factor in NLP, detrimental effects of NLP (for society), how to mitigate them with an open question is generative AI chatbots curse or blessings for the society.

Let’s start with the definition. NLP is simply the short form of natural language processing. As the name suggests, it works on processing natural (human) languages. By a concrete definition, NLP is the branch of AI that is dedicated to comprehending and processing human languages through reading or listening, extracting meaning, and responding meaningfully. It involves the study of languages and computer science to make AI models which can decipher and mimic human languages.

One perfect example of machines using NLP to mimic human language is robot Sophia. If you remember, there was a big fuss (link) when, during a conversation, the robot Sophia replied that she wants to destroy humans. Every now and then, you will read about this or that chatbot expressing a desire to end humanity.No, nothing robotic has become misanthropic or developed a conscious intent to end humanity (not yet). These things are designed to put words together for conversational purposes by using linguistic, mathematical, or other techniques that mimic human conversation. In reality it is human like but not humane conversation, similar to parrots.

Generative AI, chatbots, or Sophia may have some detrimental effects on society, which will be discussed in the final article, but they pose no direct science fiction threats to humanity, and they don’t have the capability to think or act on their own. The idea of conscious (advanced) AI is a hypothetical concept (AGI.)

To better understand NLP, we need to grasp the intuitive basics of the processes involved. So, how does NLP work? The processes include:

Segmentation: This is the first part which breaks down the content (collection of sentences for example a document or part of a conversation) into sentences segmenting by punctuations like full stops or commas. This is called segmentation.
Tokenization: Breaking sentences into their constituent words is known as tokenizing. Each individual word is referred to as a token.
Remove stop words: After segmenting the content ( document or part of conversation) into sentences and sentences into words we get rid of non essential or very commonly used words. These words are called stop words. Examples of stop words in English are “a,” “the,” “is,” “are,” etc.
Stemming and lemmatization: Now that we have some words we remove suffixes and prefixes because we know words with suffixes and prefixes have the same meaning as the original word. This is called stemming because we reduce a word to its original stem. Lemmatization is similar to stemming. In linguistic study lemma means the root of a word. Lemma of better is good, and running is run. What lemmatization will do is make running as run and better as good. Steps 1-4 collect the very basic words from the content.

Word to vector: These steps (1-4) in NLP are actually text data preprocessing so that we can transform the text into computer understandable language. After that is the process of converting the text data into vectors. This is because machines do not understand human texts (or audio) since they can only process binary code, represented as 1s and 0s. To solve this problem we need to convert text into 1s and 0s so that machines can understand and process it.This is called word to vector. Using some Mathematical/AI/ML(such as word2vec, TFIDF, bag of words, GloVe etc) techniques we convert text into vectors so that machines can understand. After performing steps 1-5 what we are left with is not content such as text or voice rather machine readable vectors. These vectors are called metadata.

After step 1-5 machines are able to read, understand and interpret human generated content and it is the first part of NLP. This part is called natural language understanding (NLU.) In short, NLP comprises two basic parts: natural language understanding (NLU) and natural language generating (NLG). The steps 1-5 are NLU. Although there are some more things to do such as dependency parsing and parts of speech tagging these steps are omitted since tutoring technicalility is not the main goal of this article.

Basically, NLU helps machines understand and make sense of human language by pulling out important information from the content generated by humans.

Now, NLG is the next and reverse process of NLU. NLU lets machines understand human content in computer language format which we call metadata. NLG transforms the machine readable data (metadata) into human language again (content which is text or audio). In concrete definition NLG is the process of translating machine data into text or speech using AI models.

This concludes the introduction of NLP. In the next article we will discuss some regular product ideas using NLP.

Introduction to Natural Language Processing

1 Comment

Industry

Explore

Company