Get all the news articles on CAA in few lines of code

Anusree S
2 min readFeb 10, 2020

Whenever we hear about a social issue for the first time (that feeling of whether we were living under a rock ) we wish we knew all the sides of the story by getting all the new articles based on it.

A few lines of code will do the trick. There is a very interesting python package called newspaper which can be used to extract online news articles. You can install newspaper by

pip install newspaper3k

Now let’s import all the necessary libraries.

import newspaper
from newspaper import Article
import nltk
import re

There can be many online news sources. We can mention our desired source. In order to read from multiple sources, multiple sources can be initialized. By setting “memoize_articles ” to True, we won't be fetching the articles that we have fetched once by running the code.

source = newspaper.build('https://www.thequint.com/' ,memoize_articles = False)

Each news article will have several tags associated with it to identify the topic on which the article is based on. To compare whether the article we are fetching is related to our topic, we will create a list with the necessary tags.

key = ['CAA', 'caa', 'nrc', 'NRC', 'Jamia', 'npr','NPR']
url=[]

Access each article in the feed by “source.articles”. The tags of an article can be understood from its title which is extracted from the URL after removing all the special character. Compare the tags by using intersection function after converting both lists into sets. URL of each article can be accessed by “article.url”. Each article has similar other attributes like author, title, publishing date, text etc.. which can be similarly accessed. Check out the documentation for detailed attributes lists.

This method can be used to extract the URLs without downloading the article and is quite fast.

for article in source.articles:

a_url = re.split(r'[/\-?]',article.url.lower())
if(set(key).intersection(set(a_url))):
url.append(article.url)

In order to perform NLP operations like summarization of an article, first, the article needs to be downloaded and parsed.

for article in source.articles:    article.download()
article.parse()
article.nlp()
a_key= article.keywords
if(set(key).intersection(set(a_key))):
url.append(article.url)

After downloading articles.keywords give you the associated tags.

#to print the text
print(article.text)
#to print the summary
print(srticle.summary)

And there you go!

--

--