How to Download Files from URLs using Python

In this tutorial, we will discuss how we can download files from the web using the third-party library in Python. We will also explore the stream data to download large files in manageable chunks and implement parallel downloads using a pool of threads.

Python provides a comprehensive array of tools and libraries that are valuable across a range of applications, spanning from web parsing to script automation and examination of acquired information. It provides the ability to fetch files from a URL which is an important task to acquire the required information.

Simplifying File Downloads Using Python

Apart from programming languages, we can also download files using the command-line tools. However, Python provides the various several libraries that facilitate file retrieval. Python provides various advantages over the command line tools. Flexibility is a key benefit, attributed to Python's diverse range of libraries. These resources encompass effective techniques for managing various file formats, protocols, and authentication approaches. It allows us to select the Python tools that are suitable for our objectives, whether we're retrieving content from a basic CSV file or an intricate binary file.

Another point is about how well your code can move around. Sometimes you might work on apps that need to run on different types of computers. In these cases, Python is a good option because it works on all kinds of computers. This means your Python code will work the same on Windows, Linux, and macOS.

Using Python also allows us to automate things that we do manually, which can save us time and energy. For example, we can set up Python to try downloading again if it doesn't work the first time, get lots of files from the internet and save them automatically, and manage the data by putting it where we want it.

Downloading a File From in URL in Python

In this section, we will get started with downloading a special kind of file called a ZIP file. This file has information about the money a country makes, which is called the Gross Domestic Product (GDP). You will use two regular helpers in Python, urllib and requests, to bring this GDP data for different countries. Although Python already has a helper package called urllib, it has some limitations. So, we'll also discover how to use another helper from outside called requests. This one is pretty famous and has more things it can do when you want to get stuff from the internet.

Using urllib From the Standard Library

Python comes with a library called "urllib" that helps you easily work with the web resources. It's designed to be easy to use, which is great when you're starting or working on smaller projects. This library lets you do various jobs related to the web, like looking at website addresses, getting information from websites, saving files, and dealing with problems that might happen when you're working on the web.

Since urllib is a part of the basic set of tools in Python, you don't need to install additional packages to use it. This makes it really easy to work with. You can use it for both making things and putting them out there. It also works the same way on different types of computers, so you don't have to change anything if you switch between Windows, Linux, or macOS.

On top of that, you can make the urllib toolbox even more powerful by adding with other helpers from outside, such as requests, BeautifulSoup, and Scrapy. This allows to doing more advanced things like collecting information from websites and talking to web services.

To download a file from a URL using the urllib package, you can call urlretrieve() from the urllib.request module. This function fetches a web resource from the specified URL and then saves the response to a local file. To start, import urlretrieve() from urlllib.request.

After that, we need to define the URL that we want to retrieve the data and also need to specify where we want to save the data. Otherwise, it will be saved into the temporary file. Since we're aware that we'll be getting a ZIP file from that website, we can choose where it should go by giving a path to the file.

url = (
    "https://api.worldbank.org/v2/en/indicator/"
    "NY.GDP.MKTP.CD?downloadformat=csv"
)
filename = "gdp_by_country.zip"

After that, you can download files using urlretrieve(). You just need to provide it the web address and, and path of the file on your computer.

The function returns two things in a pair: where your file is now, and an HTTP message object. When you don't define a path for your file, it might show you a path to a temporary file like - /tmp/tmps7qjl1tj. This special message represents HTTP headers returned by the internet server. It has details about content type, content length and other metadata or extra information.

Let's see the complete code -

Example -

from urllib.request import urlretrieve
url = (
    "https://api.worldbank.org/v2/en/indicator/"
    "NY.GDP.MKTP.CD?downloadformat=csv"
)
filename = "gdp_by_country.zip"
path, headers = urlretrieve(url, filename)
for name, value in headers.items():
    print(name, value)

Output:

Date Thu, 19 Aug 2023 00:00:00 GMT
Server Apache
X-Powered-By PHP/5.6.33
Content-Encoding gzip
Content-Disposition attachment; filename=API_NY.GDP.MKTP.CD_DS2_en_csv_v2_2760797.csv
Content-Type application/zip
Connection close
Transfer-Encoding chunked

This info can be handy when you're not sure what type of file you downloaded and how to understand what's inside. Here, it's a ZIP file that's around 128 kilobytes big.

Now that you've learned how to get a file from a web address using Python's urllib library, let's try doing the same thing using a third-party library. This will help you figure out which method is easier.

Using the Third-Party Requests Library

Although urllib is a good tool that comes with Python, there might be times when you want to use extra helpers from outside, especially for trickier jobs like making special web requests or needing certain types of permission. One such helper is the requests library. It's widely used, easy to use, and follows Python's style. It manages the complex parts of communicating to the internet.

The requests library is famous for being adaptable and giving you control over how things get downloaded. This means you can change things to match exactly what your project needs. For example, you can decide how the request looks, manage cookies, get into pages that need a login, take data bit by bit, and do even more things like that.

Furthermore, this library is made to work well and be quick by having many useful abilities that boost how things are downloaded. It's good at dealing with things like keeping connections ready to use, which makes the internet use less energy and work faster. This way, things get done without wasting time.

We can install this library using the pip command.

After installing it, we can import it into our local machine. As we know that, when we make HTTP requests the web services, there are two common methods to use -

HTTP GET
HTTP POST

You'll make use of the GET method to get information by fetching a copy of something from remote resource, without changing anything on the server. This method is commonly used for getting things like images, web pages, or raw data. You'll use the GET way of getting things later on.

On the other hand, the POST method lets you send data that the server will use or work with to make or change something. With POST, the data is usually sent as part of the request, and you won't see it in the web address. You can use POST for tasks that change things on the server, such as making, updating, or giving new or existing things.

First import the requests library and define the URL of the file that you want to download. To include additional query parameters in the URL, you'll pass in a dictionary of strings as key-value pairs.

url = "https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD"query_parameters = {"downloadformat": "csv"}

In the previous instance, you set up the URL like before, but you also add the "downloadformat=csv" part using a special list of words and their meanings in Python called a dictionary. When you give this list to `requests.get()`, the library appends these extra things to the web address automatically.

Using a GET request, it gets the information from the web address you defined, including any extra details. The helper returns the response from the internet in the form of an HTTP response. If you want to look at the whole web address it used, along with the extra parameter, you can find it using the `.url` part of the response.

response.url'https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv'

The response you get also has some other useful information you can look into. For instance, you can use these two to figure out if everything went successful with the request and what HTTP code the web server returned.

>>> response.ok
True

>>> response.status_code
200

Storing Downloaded Content to File

Now that you've fetched stuff from the internet using the requests library, you can keep it on your computer. When you put things into a file with Python, it's a really good idea to use the 'with' statement of doing it. This way, Python makes sure everything works smoothly, including files, and it closes them for you when you're done.

with open("gdp_by_country.zip", mode="wb") as file:
    file.write(response.content)

Downloading a Large File in a Stream Fashion

You've learned how to bring down one ZIP file using both the built-in urllib library and the third-party library called the requests library. But if you need to download large file, you might have a problem following the steps above because the whole thing won't fit in your computer's memory.

To solve those problems, you can get big things in a special way called "streaming". This means you don't try to look at the whole big thing all at once, but you take little parts at a time. Streams let you work with these small parts, so you don't use up your entire computer's memory and things go faster.

Streaming data has benefits in other situations too, like when you're downloading files in Python. Following are the some advantages -

Download and process a file in small chunks - This can be really useful when the web resources rules say you can't take too many files all at once. With streaming, you can work around these rules and get and work on a file part by part, which is a smaller bit.
Work with the data in real-time - When you do this, you can use the information from the files you're getting right away. This way, you can start using what you have while the rest of the data keeps coming.
Stop and start the downloading - This lets you get some of the file, then stop if you need to, and then start again from where you stopped. You don't have to begin from the very beginning each time.

When you want to get a really big file bit by bit, you would make the request but not take everything at once. Instead, you'd start getting just the information about what's coming and keep the connection open. This is done by using a special setting in the `requests.get()` function.

We will try downloading a big file (about 72 megabytes) that has information about the World Development Indicators from the World Bank Open Data.

url = "https://databank.worldbank.org/data/download/WDI_CSV.zip"response = requests.get(url, stream=True)

When you use `stream=True`, the requests library asks for a file in a special way that lets you get it bit by bit. It's like asking for just the front part of the file first. You can look at this first part, which has special information, by using `.headers` of the thing you got.

Example -

import requests
url = "https://databank.worldbank.org/data/download/WDI_CSV.zip"
response = requests.get(url, stream=True)
print(response.headers)

Output:

{'Date': 'Wed, 14 Jun Aug 12:53:58 GMT',
 'Content-Type': 'application/x-zip-compressed',
 'Content-Length': '71855385',
 'Connection': 'keep-alive',
 'Last-Modified': 'Thu, 11 May 2023 14:56:30 GMT',
 'ETag': '0x8DB522FE768EA66',
 'x-ms-request-id': '8490ea74-101e-002f-73bf-a9210b000000',
 'x-ms-version': '2009-09-19',
 'x-ms-lease-status': 'unlocked',
 'x-ms-blob-type': 'BlockBlob',
 'Cache-Control': 'public, max-age=3600',
 'x-azure-ref': '20230628T125357Z-99z2qrefc90b99ypt8spyt0dn40000...8dfa',
 'X-Cache': 'TCP_MISS',
 'Accept-Ranges': 'bytes'}

You might have noticed one of the headers mentioning that the server is keeping the connection open. This is called an "HTTP persistent connection." It's like having a line open for talking instead of dialing and hanging up each time. Without this, every time you want to make connection request, you'd need to dial and hang up, which takes more time and effort.

Another benefit of using the streaming mode in the requests library is that you can get the data in chunks, even if you send only a single request. To do this, you can use a method called `.iter_content()` from the response object. This method lets you go through the data in small, easy-to-handle chunks. Also, you can choose how big these parts are by using the `chunk_size` setting. It tells the program how many bytes to read at once.

The requests library provides a wide range of capabilities that can be useful in various download situations. While we haven't gone deeply into these tasks, the library includes functions for dealing with authentication, managing redirects, controlling sessions, and more. These capabilities offer greater control and flexibility, especially for more complex tasks.

Conclusion

Python can be a powerful tool for automating file downloads and gaining precise control and adaptability in the process. Python provides a range of solutions for downloading files from URLs. This includes downloading large files, handling multiple downloads, and retrieving files from websites that require special access permissions.

In this tutorial, you've grasped the process of downloading files using Python. You've also learned how to:

Use both native and third-party libraries to download files from the internet.
Implement data streaming to efficiently download large files in smaller, more manageable segments.

Next TopicLog Functions in Python

← prev next →