- CodeCraft by Dr. Christine Lee
- Posts
- Extract Keywords Like a Pro
Extract Keywords Like a Pro
Build Your Own Keyword Extractor with Python!

Your Brilliant Business Idea Just Got a New Best Friend
Got a business idea? Any idea? We're not picky. Big, small, "I thought of this in the shower" type stuffโwe want it all. Whether you're dreaming of building an empire or just figuring out how to stop shuffling spreadsheets, we're here for it.
Our AI Ideas Generator asks you 3 questions and emails you a custom-built report of AI-powered solutions unique to your business.
Imagine having a hyper-intelligent, never-sleeps, doesn't-need-coffee AI solutions machine at your beck and call. That's our AI Ideas Generator. It takes your business conundrum, shakes it up with some LLM magic andโvoila!--emails you a bespoke report of AI-powered solutions.
Outsmart, Outpace, Outdo: Whether you're aiming to leapfrog the competition or just be best-in-class in your industry, our custom AI solutions have you covered.
Ready to turn your business into the talk of the town (or at least the water cooler)? Let's get cracking! (And yes, itโs free!)
Welcome back, coding enthusiasts! ๐
In our previous posts, we've explored various NLP techniques, from sentiment analysis to language translation. Today, we're diving into the world of keyword extraction.
This project will guide you through the process of building your own keyword extractor using Python and the RAKE (Rapid Automatic Keyword Extraction) algorithm.
Let's get started!
Introduction
Ever wanted to extract the most important keywords from a text? Keyword extraction is a useful technique to identify the most important words or phrases in a text.
With Python and the RAKE algorithm, you can build a simple yet powerful keyword extractor. This project will guide you through the steps to create a keyword extractor using the rake-nltk
library.
Step-by-Step Guide
1. Install the Required Libraries
First, ensure you have the nltk
and rake-nltk
libraries installed. Open your terminal or command prompt and run the following commands:
pip install nltk
pip install rake-nltk
2. Import the Libraries
Next, import the necessary classes from the nltk
and rake-nltk
libraries.
import nltk
from rake_nltk import Rake
3. Download the Required NLTK Resource
Download the NLTK resource needed to split text into individual sentences.
nltk.download('punkt')
4. Initialize RAKE with Stopwords for English
Initialize the RAKE object. RAKE uses stopwords to ignore common words and focus on significant keywords.
rake = Rake()
5. Define the Text to Extract Keywords From
Now, define the text you want to extract keywords from.
text = """
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. These machines can be trained to perform specific tasks, recognize patterns, and make decisions with minimal human intervention.
"""
6. Extract Keywords
Use the RAKE object to extract keywords from the text.
rake.extract_keywords_from_text(text)
7. Get Ranked Keywords
Get the keywords ranked by their relevance.
keywords = rake.get_ranked_phrases()
print("Keywords:", keywords)
Complete Code
Here's the complete code for your keyword extractor:
import nltk
from rake_nltk import Rake
# Download resource to split text into individual sentences
nltk.download('punkt')
# Initialize RAKE with stopwords for English
rake = Rake()
# Define the text
text = """
Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think and learn like humans. These machines can be trained to perform specific tasks, recognize patterns, and make decisions with minimal human intervention.
"""
# Extract keywords
rake.extract_keywords_from_text(text)
keywords = rake.get_ranked_phrases()
print("Keywords:", keywords)
Explanation
Step 1: Install the Required Libraries
This step ensures that the nltk
and rake-nltk
libraries are installed on your system.
Step 2: Import the Libraries
Here, you import the necessary modules from the nltk
and rake-nltk
libraries.
Step 3: Download the Required NLTK Resource
The punkt
resource is used to split the text into sentences. This step ensures that the necessary data is downloaded for text tokenization.
Step 4: Initialise RAKE with Stopwords for English
RAKE is initialised with the default stopwords for the English language. Stopwords are common words that are typically ignored in text analysis.
Step 5: Define the Text to Extract Keywords From
In this step, you define the text from which you want to extract keywords. The text can be any block of text you want to analyse.
Step 6: Extract Keywords
The extract_keywords_from_text
method processes the text and identifies keywords.
Step 7: Get Ranked Keywords
The get_ranked_phrases
method retrieves the keywords and ranks them based on their relevance. These keywords are then printed.
How the Methods Work?
Here's a breakdown of how extract_keywords_from_text()
and get_ranked_phrases()
work within the rake_nltk
library:
1. extract_keywords_from_text(text):
Purpose: This method takes a block of text as input and processes it to identify potential keywords. It uses a technique called RAKE (Rapid Automatic Keyword Extraction).
Steps:
Sentence Segmentation: The input text is first divided into individual sentences.
Word Tokenization: Each sentence is broken down into individual words or tokens.
Stop Word Removal: Common words that don't carry much meaning (like "the," "a," "an") are removed from the list of words. These are called "stop words."
Phrases Creation: The remaining words are combined into meaningful phrases. This could involve considering word proximity, grammatical relationships, and other factors.
Phrase Weight Calculation: RAKE assigns a "weight" to each phrase based on its importance in the text. This weight is typically calculated using a combination of factors like word frequency, phrase length, and the number of words in the surrounding context.
2. get_ranked_phrases():
Purpose: This method retrieves the ranked list of keywords (phrases) identified by extract_keywords_from_text().
How it Works:
It retrieves the phrases and their calculated weights that were stored internally during the extract_keywords_from_text() process.
It sorts these phrases in descending order based on their weights.
It returns the sorted list of phrases, with the highest-weighted phrases appearing first. This represents the most important keywords extracted from the text.
In essence, the process can be summarised as follows:
Break down the text: Divide the text into sentences and then words.
Identify meaningful phrases: Combine words into phrases based on their relevance.
Score the phrases: Assign a score to each phrase based on its importance in the text.
Rank the phrases: Sort the phrases by their scores, with the most important phrases at the top.
Output
Running the complete code will produce the following output:

Summary
In this project, you've learned how to extract keywords from a block of text using the RAKE algorithm. We started by installing the necessary libraries, then moved on to importing them and downloading the required NLTK resource. You learned how to initialise the RAKE object, define the text for keyword extraction, and finally, extract and rank the keywords. This process highlights the power and simplicity of using Python and NLP libraries to perform keyword extraction, making it easier to identify the most important parts of any text.
Coding with a Smile ๐คฃ ๐
Tuple Trouble: Tuples are great until you try to change one of their values. Then they become that stubborn friend who refuses to change their mind no matter what you say. Immutable means immutable, deal with it!
Ready for More Python Fun? ๐ฌ
Subscribe to our newsletter now and get a free Python cheat sheet! ๐ Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.
Keep learning, keep coding ๐ฉโ๐ป๐จโ๐ป, and keep discovering new possibilities! ๐ปโจ
Enjoy your journey into artificial intelligence, machine learning, data analytics, data science and more with Python!
Stay tuned for our next exciting project in the following edition!
Happy coding!๐๐โจ