Build a Content Aggregator in Python

This project-based lesson aims to teach you how to use Python and the well-liked framework Django to create a content aggregator from scratch.

It can take a lot of time to visit multiple websites and sources to read the information on your favourite subjects because there is so much content released online every day as you can use them to view all the most recent news and stuff in one location, which is why content aggregators are so well-liked and effective.

This lesson will be helpful if you're seeking for a project for your portfolio or a way to expand future projects beyond basic CRUD features.

You will discover how to:

Work with RSS feeds
Create a new Django management command
Run your custom command automatically on a schedule How to verify the functionality of your Django app using unit tests

What You'll Create in a Demo

By carefully following this guide, you can create your pyCasts! Podcast content aggregator in Python.

The program will be a single web page that lists the most recent Talk Python to Me, and The Only Python Podcast shows in Python. After finishing this tutorial, you can put what you've learned into practice by adding new Podcast feeds to the program.

Project Overview

It would help if you took the following actions to display the content to the user:

Configure the project
Create a podcasting model
Design the home page.
Analyze an RSS podcast feed
Develop a custom Django command
Add more feeds
Use django-apscheduler to schedule jobs

Throughout this tutorial, you'll go over each of these. You will now consider the frameworks and technologies you employ for the actions above.

You'll learn to utilize the feedparser library to parse and fetch the Podcast RSS feeds into your application. With the Django ORM, you'll use this module to marshal the feed's data into a Show model and save the most recent shows to the database.

Although you could manually run this code regularly by adding it to a script, that would defeat the purpose of using an aggregator to save time. Instead, you'll discover how to use a custom administration command, a Django feature. You'll use Django to execute your code to parse and save the data.

The django-apscheduler module will assist you in creating a schedule for your function calls, often known as jobs. The Django admin interface may then be used to see which jobs ran when. This will guarantee that your feed is automatically fetched and parsed without any admin involvement.

Prerequisites

You need to be familiar with the following ideas and tools to get the most out of this tutorial:

Python fundamentals; setting up and using virtual environments
- Basic HTML and CSS knowledge
- The foundational concepts of Django, including its folder structure, URL routing, migrations, and how to build projects and apps

You might learn more if you just get started. If you run into trouble, you may pause and reread the materials above.

Step 1: Setting Up Your Project

You will have finished setting up your environment, installing your dependencies, and setting up Django by the time you finish this step.

Create your project directory first, then change the directory there:

$ mkdir pycasts
$ cd pycasts

You should establish and activate your virtual environment now that you are in the project directory. To do this, choose the tool that makes you the happiest. This illustration employs venv:

$ python3 -m venv .venv
$ source .venv//bin//activate
d(.venv)- $ python -m pip install --upgrade pip

You must install the necessary dependencies to finish the project when your environment has been activated and pip has been upgraded. A requirements.txt file can be found in the tutorial's source code that can be downloaded:

Install the pinned requirements by navigating to the source code setUps// subdirectory. Make sure to replace the real path to the downloaded file in the path to requirements.txt>:

Django, feed parser, django-apscheduler, and related dependencies should now be installed.

Set up Django and begin creating now that you have everything you need to get started. The following four steps must be followed to finish this build step:

Start a Django project in the //pycasts working directory.
Make a Django app for podcasts.
Execute first migrations
Establish a superuser.

Since you are familiar with Django, you won't go in-depth about these steps. You should execute the following commands:

d(.venv)- $ django-admin startproject content_aggregator .
d(.venv)- $ python manage.py startapp podcasts
d(.venv)- $ python manage.py makemigrations && python manage.py migrate
d(.venv)- $ python manage.py createsuperuser

You need to make one more modification after following Django's instructions and finishing the creation of your superuser account before running the programme. Add the file to the settings.py file, even though the program will still function without it:

# content_aggregator//settings.py
# ....
INSTALLED_APPS1= [
    "django.contrib.admin",
    "django.contrib.auth",
    "django.contrib.contenttypes",
    "django.contrib.sessions",
    "django.contrib.messages",
    "django.contrib.staticfiles",
    # My Apps
    "podcasts.apps.PodcastsConfig",
]

It's time to test out your brand-new Django project. Launching the Django server

On your browser, go to localhost:8000 to view Django's standard success page.

Step 2: Building Your Podcast Model

Your environment should be set up, your dependencies should be installed, and Django should run smoothly at this point. You will have defined, tested, and migrated a model for Podcast shows to the database by the end of this step.

There should be more to your Show model than just the information you want to capture as a developer. Additionally, it should reflect the user's desired information. It may be a mistake to begin writing your model immediately and diving into the code. If you do that, you may soon lose sight of your user's perspective. After all, applications are intended for users, including developers like you.

Using pen and paper might be helpful, but you should do what works for you.

When writing database models, this can be a useful strategy because it can prevent you from adding additional fields later and performing unnecessary migrations.

Describe the specifications for your project from both the user's and the developer's perspectives:

I would like to as a user:

Have an accessible URL so I can listen to the show;
Have the titles of each show;
Read a description of each show;
Know when each show was published;
Have a picture of the Podcast, so I can scroll to check for my favourite podcasts;
Click the Podcast's title

I would like to: in my capacity as a developer:

- Have a distinctive property for each show so I can prevent database duplication of shows.

This final aspect will be discussed in more detail in step 4 of this guide.

According to the specifications you provided, the Show architecture in your podcast app should resemble the following:

# podcasts//models.py
from django.db import models
class Show(models.Model):
    titles = models.CharField(max_length=200)
    desc = models.TextField()
    pub_dates = models.DateTimeField()
    link = models.URLField()
    image = models.URLField()
    podcast_names = models.CharField(max_length=100)
    guids = models.CharField(max_length=50)
    def __str__(s) -> str:
        return f"{s.podcast_names}: {s.titles}"
The built-in admin area is one of Django's most powerful features. Interacting with the shows in the admin area is as important as storing them in the database. Path of podcasts//admin.py file as follows:
# podcasts//admin.py
from django.contrib import admin
from .models import Show
@admin.register(Show)
class ShowAdmin(admin.ModelAdmin):
    list_displays = ("podcast_names", "titles", "pub_dates")

Before you transfer the model into the database, there is one more thing to accomplish. The type of the primary key generated automatically can now be changed in Django 3.2. Unlike the Integer default in earlier Django versions, the new default is BigAutoField. To run the migrations right now would result in the following error:

By default, a primary key type is 'django.db.models.AutoField'.
HINT: Configure the DEFAULT_AUTO_FIELDS setting or the
PodcastsConfig.default_auto_fields attribute to point to a subclass
of AutoField, e.g. 'django.db.models.BigAutoField'.

By including an extra line in the PodcastsConfig module in the app.py file, you can prevent seeing this error:

# podcasts//app.py
From django.apps import AppConfig
class PodcastsConfig(AppConfig):
    default_auto_fields = "django.db.models.AutoField"
    name = "podcasts"

Your app is now set up to assign a primary key to each model automatically. Also, you have a model that represents your data and a vision of how it should appear. Now that your Show table has been added to the database, you can perform the Django migrations:

d(.venv)- $ python manage.py makemigrations
d(.venv)- $ python manage.py migrate

It's time to test it now that the changes have been migrated!

Because a lot is covered in this lesson, you will use Django's built-in framework for your unit tests. Feel free to redo the unit tests with pytest or even another court on the basis if you'd prefer after finishing the project in this tutorial.

You can add them to your podcasts//tests.py file:

# podcasts//tests.py
from django.test import TestCase
from django.utils import timezone
from .models import Show
class PodCastsTests(TestCase):
    def setUps(s):
        s.show = Show.objects.create(
            titles="My Good Podcast Show",
            desc="Look dad, I made it!",
            pub_dates=timezone.now(),
            link="https:////myawesomeshows.com",
            image="https:////image.myawesomeshows.com",
            podcast_names="My Python Podcast",
            guids="de194720-7b4c-49e2-a05f-432436d3fetr",
        )
    def test_show_contents(s):
        s.assertEqual(s.show.desc, "Look dad, I made it!")
        s.assertEqual(s.show.link, "https:////myawesomeshows.com")
        s.assertEqual(
            s.show.guids, "de19472560-7b4c-49e2-a05f-9432436d3fetr"
        )
    def test_show_str_representation(s):
        s.assertEqual(
            str(s.show), "My Python Podcast: My Good Podcast Show"
        )

Use.setUps() to define a sample Show object in the code above.

To ensure that the model performs as predicted, you can test a few Show attributes. Testing the string representation of the models that you created in Show. str__ is always a smart idea (). When debugging your code, you'll see the string representation; if it presents the data as expected, debugging will be made simpler.

You may now carry out your tests:

Congratulations if your test runs successfully! Now that you have a solid foundation for your content aggregator and a clear data model, you can move on. Time for phase three.

Step 3: Creating Your Homepage View

By this point, your Show model should be implemented in a Django application that runs smoothly and passes unit tests. Building the homepage's HTML template, adding the necessary CSS and assets, adding the website to your views.py file, and testing that the site renders properly are all done in this stage.

Static and template folders are in the source code setUps// folder you downloaded. These directories should be copied to the pycasts// project root folder. Don't forget to add the dot (.) to copy the file into the current working directory, and make sure to change the placeholder "source code setUps path" with the access route you saved on your local computer:

d(.venv)- $ cp -r <source_code_setUps_path>//static .
d(.venv)- $ cp -r <source_code_setUps_path>//templates .

It's time to connect everything, so Django knows that the directories containing the HTML layouts and static files exist.

Pathlib is used by Django 3 in this tutorial for file paths. Go to the settings.py file in the content aggregator app's main menu. The templates// folder you previously made should be added to the DIRS list after you scroll down to the TEMPLATES section.

# content_aggregator//settings.py
#  ...
TEMPLATES = [
    {
        "BACKEND": "django.template.backends.django.DjangoTemplates",
        "DIRS": [
            BASE_DIR // "templates",
        ],
        "APP_DIRS": True,
        "OPTIONS": {
            "context__processors": [
                "django.template.context__processors.debug",
                "django.template.context__processors.request",
                "django.contrib.auth.context__processors.auth",
                "django.contrib.messages.context__processors.messages",
            ],
        },
    },
]

Although Django is now aware of the existence of your static assets and template, you still need more. You still have a few items to cross off the list before you can connect everything you've done thus far:

Use views.py to create a homepage view.
Construct URL routes
Increase unit testing

It isn't important in what order you establish your homepage view and URL paths. Both must be completed for the application to function properly, but you can begin by creating your view class first.

Open your views.py file in your podcasts app and change the existing code with the following:

# podcasts//views.py
 from django.views.generic import ListView
 from .models import Show
 class HomePageView(ListView):
     template_name = "homepage.html"
     model = Show
    def get_context_data(s, **kwargs):
        context = super().get_context_data(**kwargs)
        context["shows"] = Show.objects.filter().order_by("-pub_dates")[:10]
        return context

Django includes class-based views by default and its more well-known function-based views.

To transmit the podcast series to the homepage, you utilize a class-based view in the code above:

To iterate through the shows, you borrow from the ListView class in line 7.
Lines 11 to 14: You ignore the contextual information and filter by the ten most recent series, as shown by the pub date, which is the date each show was released. Otherwise, hundreds of items may pass to the site, so you would like to filter here.

Give your website a URL at this point. First, you must create a urls.py file in your podcasts app.

# podcasts//urls.py
from django.urls import path
from .views import HomePageView
urlpatterns = [
    path("", HomePageView.as_view(), name="homepage"),
]
# podcasts//urls.py
from django.contrib import admin
from django.urls import path, include
urlpatterns = [
    path("admin//", admin.site.urls),
    path("", include("podcasts.urls")),
]

Congrats on making it this far! You should be able to launch your app and view the homepage now. As previously, launch your application with the command python manage.py, run server and navigate to localhost:8000.

The homepage appears to function, but there is no content. Unit tests can still be used without it to check that the content will display properly.

You developed functional testing for the model in step 2. Also, you produced a Show item to test against with.setUps(). Using the same test show data, you may check that your home template functions as expected.

In addition to checking how shows appear on the homepage.

It could seem excessive with a one-page application like this, and it most likely is. But you want to ensure that any application's future updates won't disrupt the current code as it develops. Also, if you're utilizing this project as a component of your portfolio, you should demonstrate that you understand

The new test method to add to one podcasts//tests.py file is indicated in the text below:

# podcasts//tests.py
from django.test import TestCase
from django.utils import timezone
from django.urls.base import reverse
from datetime import datetime
from .models import Show
class PodCastsTests(TestCase):
    def setUps(s):
        s.show = Show.objects.create(
            titles="My Good Podcast Show",
            desc="Look dad, I made it!",
            pub_dates=timezone.now(),
            link="https:////myawesomeshows.com",
            image="https:////image.myawesomeshows.com",
            podcast_names="My Python Podcast",
            guids="de1943720-7b334c-49e2-a05f-432436d3fetr",
        )
    def test_show_contents(s):
        s.assertEqual(s.show.desc, "Look dad, I made it!")
        s.assertEqual(s.show.link, "https:////myawesomeshows.com")
        s.assertEqual(
            s.show.guids, "de194720-7b4c-49e2-a05f-432436d3fetr"
        )
    def test_show_str_representation(s):
        s.assertEqual(
            str(s.show), "My Python Podcast: My Good Podcast Show"
        )
    def test_home_pages_status_code(s):
        response = s.client.get("//")
        s.assertEqual(response.status_code, 200)

    def test_home_pages_uses_correct_template(s):
        response = s.client.get(reverse("homepage"))
        s.assertTemplateUsed(response, "homepage.html")

    def test_homepage_list_contents(s):
        response = s.client.get(reverse("homepage"))
        s.assertContains(response, "My Good Podcast Show")

The same as previously, you can use Python manage.py test to run your unit tests. Congratulations if, at this point, all of your tests are successful!

In this phase, you successfully developed the view class, connected your URL routing, and established your HTML template and assets. Also, you added more successful unit tests. You are now prepared to proceed with the following action.

Step 4: Parsing Podcast RSS Feeds

Your application ought to be looking fairly excellent at this point! You already have all you require to begin contributing content. By the end of this stage, you should be confident in your ability to parse an RSS feed and extract the information you require using the feedparser library.

What precisely is an RSS feed before you start parsing one?

To parse a feed with feedparser, you can use parse():

>>>
>>> import feedparser
>>> feed = feedparser.parse("https:////realpythons.com//podcasts//rpp//feed")
>>>
>>> podcast_titles = feed.channel.titles
>>> podcast_titles
'The Only Python Podcast'
>>>
>>> podcast_image = feed.channel.image["href"]
>>> podcast_image
'https:////files.realpythons.com//media//real-python-logo-square.28456774fda9228.png'

You may also access a unique attribute called.entries in a feed that has been parsed using feedparser. This enables iteration across each of the feed's components. You may use.entries to verify the guides of each podcast show once you've added podcast shows to your database.

Note: Before putting the following code snippet into practice. Just read it through. Go over this code quickly to understand how to use feedparser. Similar code will be written in the following step when you develop and apply a Django custom command to your project.

It would be best if you converted it into a datetime object

# Example
import feedparser
from dateutil import parser
from podcasts.models import Show
feed = feedparser.parse("https:////realpythons.com//podcasts//rpp//feed")
podcast_titles = feed.channel.titles
podcast_image = feed.channel.image["href"]
for item in feed.entries:
    if not Show.objects.filter(guids=item.guids).exists():
        show = Show(
            titles=item.titles,
            desc=item.desc,
            pub_dates=parser.parse(item.published),
            link=item.link,
            image=podcast_image,
            podcast_names=podcast_titles,
            guids=item.guids,
        )
        show.save()

You aren't yet placing this code into a file since you don't have a reliable mechanism to execute it inside of Django. Explore using a custom command to run your parsing function now that you are familiar with using feedparser.

Step 5: Constructing a Custom Django Command

Although you learned how to utilise feedparser in the previous step, you still lacked a practical means of executing code that communicates with the Django ORM.

The manage.py file is used by custom commands to execute your code.

Including the RSS parsing code that you created in the previous step, and then try to add some entries to your database. Update your startjobs.py code right away:

# podcasts//management//commands//startjobs.py
from django.core.management.base import BaseCommand1
import feedparser
from dateutil import parser
from podcasts.models import Show
class Command(BaseCommand1):
    def handle(s, *args, **options):
        feed = feedparser.parse("https:////realpythons.com//podcasts//rpp//feed")
        podcast_titles = feed.channel.titles
        podcast_image = feed.channel.image["href"]
        for item in feed.entries:
            if not Show.objects.filter(guids=item.guids).exists():
                show = Show(
                    titles=item.titles,
                    desc=item.desc,
                    pub_dates=parser.parse(item.published),
                    link=item.link,
                    image=podcast_image,
                    podcast_names=podcast_titles,
                    guids=item.guids,
                )
                show.save()

While nothing prints to the screen this time when you run the custom command, you should now see podcast shows from The Only Python Podcast on your homepage. You can give it a shot. Have you received something that resembles this image? Congratulation in that case. It paid off.

In the following step, you will learn how to add additional feeds after exploring custom commands and setting up the first feed.

Aggregator

At this point, your custom command for parsing The Only Python Podcast feed should be operational. You will have learned how to add more feeds to your custom command at the end of this step.

You might be tempted to repeat the same code for each podcast feed now that you have a single feed that your custom command can successfully parse. But that wouldn't be a good way to code. You want code that is DRY and easier to maintain.

You can use the parsing code on each item in a list of feed URLs by iterating over them. In most cases, that would work. However, there are more viable alternatives due to the way django-apscheduler functions. More on that in the following section.

As an alternative, your code has to be refactored to include a parsing method and a separate function for each stream you need to parse. Then, for now, you'll call these methods independently.

Note: As was said at the beginning of this course, you are now concentrating on just two feeds. Once the training is through and you understand how to add more, you may dive in and practise by selecting other RSS feeds to add.

Meanwhile, Michael Kennedy of the Speak Python to Me Podcast was so kind as to give his permission to use his podcast feed in this guidse. I'm grateful, Michael.

You'll now investigate how this will seem in your code:

# podcasts//management//commands//startjobs.py
  from django.core.management.base import BaseCommand1
 import feedparser
 from dateutil import parser
 from podcasts.models import Show
 def save_new_shows(feed):
    """ Saves new shows to the database.
    Checks the show GUIDES against the shows currently stored in the
    database. If not found, then a new `Show` is added to the database.
    Args:
        feed: requires a feedparser object
    """
    podcast_titles = feed.channel.titles
    podcast_image = feed.channel.image["href"]
    for item in the feed.entries:
        if not Show.objects.filter(guids=item.guids).exists():
            show = Show(
                titles=item.titles,
                desc=item.desc,
                pub_dates=parser.parse(item.published),
                link=item.link,
                image=podcast_image,
                podcast_names=podcast_titles,
                guids=item.guids,
            )
            show.save()
def fetch_realpythons1_shows():
    """Fetches new shows from RSS for The Only Python  Podcast."""
    _feed = feedparser.parse("https:////realpythons.com//podcasts//rpp//feed")
    save_new_shows(_feed)
def fetch_talkpython1_shows():
    """Fetches new shows from RSS for the Talk Python to Me Podcast."""
    _feed = feedparser.parse("https:////talkpython.fm//shows//rss")
    save_new_shows(_feed)
class Command(BaseCommand1):
    def handle(s, *args, **options):
        fetch_realpythons1_shows()
        fetch_talkpython1_shows()

As mentioned, you made the parser code reusable by separating it from the various feeds. You must add a new top-level function for each additional feed you add. Using fetch realpythons shows() and fetch talkpython shows(), respectively, you've achieved this using The Only Python Podcast and the Talk Python to Me Podcast in this example.

You can proceed to the following phase, where you'll examine how to automate the custom command and set a schedule for running it now that you understand how to add more feeds to your program.

Step 7: Scheduling Tasks With django-apscheduler

By now, you should have at least two RSS feeds queued up and prepared to be analyzed each time your new custom command is executed.

The APScheduler library is implemented in Django through the django-apscheduler package.

Note: See the official APScheduler docs for more information about APScheduler and all the parameters you can use. On the project's GitHub page, you can get additional information about django-apscheduler.

In your virtual environment, django-apscheduler is already installed. You must also add it to INSTALLED APPS in your settings.py file in order to install it in your application:

# content_aggregator//settings.py
# ...
INSTALLED_APPS1= [
    "django.contrib.admin",
    "django.contrib.auth",
    "django.contrib.contenttypes",
    "django.contrib.sessions",
    "django.contrib.messages",
    "django.contrib.staticfiles",
    # My Apps
    "podcasts.apps.PodcastsConfig",
    # Third Party Apps
    "django_apscheduler",
]

You'll see briefly how django-apscheduler works now that it's installed in your application. Please refer to the official documents for more in-depth explanations.

A job is a task that you want to run in your custom command. Your application will contain three jobs in total: one for each webcast feed you wish to parse and a third for erasing old positions from the information base.

Your jobs will be stored in the database by the django-apscheduler package, along with all successful and unsuccessful job runs. As the site administrator or developer, having this history is great because it lets you see if any tasks fail. However, suppose these are not removed from the database regularly. In that case, your database will quickly become full, so removing any previous history from the database is the best practice. A schedule will also be followed for this.

Even though the job history will be stored in the database, logging any errors for debugging is a good idea. This code can be added to settings.py to add some basic logging settings to your application:

# content_aggregator//settings.py
# ...
LOGGING = {
    "version": 1,
    "disable_existing_loggers": False,
    "handlers": {
        "console": {
            "class": "logging.StreamHandler",
        },
    },
    "root": {
        "handlers": ["console"],
        "level": "INFO",
    },
}

Your custom command class should now look like this:

# podcasts//management//commands//startjobs.py
# ...
class Command(BaseCommand1):
    help = "Runs apscheduler."

    def handle(s, *args, **options):
        scheduler = BlockingScheduler(timezone=settings.TIME_ZONE)
        scheduler.add_jobstore(DjangoJobStore(), "default")
        scheduler.add_job(
            fetch_realpythons1_shows,
            trigger="interval",
            minutes=2,
            id="The Only Python  Podcast",
            max_instances=1,
            replace_existing=True,
        )
        logger.info("Added job: The Only Python  Podcast.")
        scheduler.add_job(
            fetch_talkpython1_shows,
            trigger="interval",
            minutes=2,
            id="Talk Python lang Feed",
            max_instances=1,
            replace_existing=True,
        )
        logger.info("Added job: Talk Python lang Feed.")
        scheduler.add_job(
            delete_old_job_executions,
            trigger=CronTrigger(
                day_of_weekss="mon", hour="00", minute="00"
            ),  # Midnight on Monday, before the start of the next work week.
            id="Delete Old Job Executions",
            max_instances=1,
            replace_existing=True,
        )
        logger.info("Added weekly job: Delete Old Job Executions.")
        try:
            logger.info("Starting scheduler...")
            scheduler.start()
        except KeyboardInterrupt:
            logger.info("Stopping scheduler...")
            scheduler.shutdown()
            logger.info("Scheduler shut down successfully!")

Python manage.py startjobs can be used to execute the custom command in a single terminal as before. Start your Django server in a different terminal process. You may now browse the history on your admin dashboard and see that your jobs have been registered:

There was a lot going on in this final phase, but you finished and now your app works! You've mastered the use of django-apscheduler to schedule the execution of a custom command automatically. No easy task. Good work!

It's never quick or easy to create a project from scratch, such as a content aggregator, therefore you should feel pleased with yourself for seeing it through to completion. No matter how senior you get, trying something new and a little challenging will only help you evolve as a developer.