Crop Yield Prediction Using Machine Learning

Crop yield prediction is an important aspect of agriculture that helps farmers make informed decisions about their crops. It involves estimating the number of crops that will be produced in a given area based on various factors such as soil type, weather conditions, and crop management practices. In recent years, machine learning (ML) has emerged as a powerful tool for predicting crop yields.

Machine learning is a branch of artificial intelligence (AI) that allows computers to learn from data without being explicitly programmed. This makes it ideal for crop yield prediction because it can identify patterns and relationships in large amounts of data and make predictions based on these relationships.

There are various types of machine learning algorithms that can be used for crop yield prediction, including regression, decision trees, and artificial neural networks.

Regression algorithms are commonly used for predicting crop yields because they are simple to understand and easy to implement. These algorithms use a set of inputs (such as weather data, soil data, and management practices) to predict the output (crop yield).

Decision tree algorithms are also used for crop yield prediction. They use a tree-like structure to model decisions and their potential consequences. The algorithm starts by making a decision based on the most important input factor and then continues to make additional decisions based on subsequent inputs. The final result of the algorithm is a prediction of crop yield.

Artificial neural networks are a more complex type of machine learning algorithm that is modelled after the structure and function of the human brain. They are particularly well suited for crop yield prediction because they can handle large amounts of data and identify complex patterns and relationships.

To implement machine learning for crop yield prediction, a large dataset of crop yield data is required. This data should include information about the crop, such as the type of crop, the location, and the date of planting. Additionally, data on weather conditions and soil characteristics should also be collected. The machine learning algorithm is then trained on this data to learn the relationships between the inputs and outputs.

Once the machine learning algorithm has been trained, it can be used to make predictions about crop yields in new areas. This is done by inputting the necessary data (such as weather conditions and soil characteristics) and allowing the algorithm to make a prediction.

In this article, machine learning techniques will be used to anticipate the top 10 yields that are eaten globally.

These crops include :

Cassava
Maize
Plantains and others
Potatoes
Rice, paddy
Sorghum
Soybeans
Sweet potatoes
Wheat
Yams

Now we will implement it in the code.

1. Importing Libraries

import numpy as np
import pandas as pd

2. Gathering and Cleaning Data

Gathering and cleaning data is an essential step in machine learning, as it can significantly impact the accuracy and performance of the model.

Crops Yield Data

The ten most popular crops in the world in terms of yield were taken from the FAO website. The information gathered consists of the nation, item, year (from 1961 to 2016), and yield value.

Output:

The Total number of rows in the yield dataset is 56717, along with 12 columns.

Output:

By taking a closer look at the columns in the CSV file, we may rename Value to hg/ha yield to make it clearer that this is the production value of our crops' yields. Moreover, extraneous columns like the area code, domain, item code, etc., should be removed.

# Renaming the column.
dataframe_yield.rename( columns={"Value": "hg/ha_yield"}, inplace=True)
dataframe_yield.head()

Output:

# dropping the unwanted columns.
dataframe_yield = dataframe_yield.drop(['Year Code','Element Code', 'Element','Year Code','Area Code','Domain Code', 'Domain','Unit','Item Code'], axis=1)
dataframe_yield.head()

Output:

Climate Data: Rainfall

Precipitation and temperature are climatic elements. The environmental variables that affect plant growth and development are made up of abiotic elements, such as soil and pesticides.

The impact of rainfall on agriculture is significant. Information on annual rainfall was acquired for this project from the World Data Bank.

dataframe_rain = pd.read_csv('rainfall.csv')
dataframe_rain.head()

Output:

Now, we will check for the datatypes in the dataset.

Output:

We need to change the datatype of average_rain_fall_mm_per_year from object to float. Also, keep that in mind. It also contains some missing values.

dataframe_rain['average_rain_fall_mm_per_year'] = pd.to_numeric(dataframe_rain['average_rain_fall_mm_per_year'],errors = 'coerce')
dataframe_rain.info()

Output:

After that, we will remove any blank rows from the dataset and combine the yield dataframe with the rain dataframe according to the year and area columns.

#Dropping empty rows
datadataframe_rain =dataframe_rain.dropna()
dataframe_rain.describe()

Output:

The rainfall dataframe spans the years 1985 through 2016.

We will now combine the yield dataframe with the rain dataframe according to the year and area columns.

#Merging
dataframe_main = pd.merge(dataframe_yield, draindataframe_rain, on=['Year','Area'])
dataframe_main.head()

Output:

As the rainfall data started in 1985, we can see that the years now start with the first yield dataframe, which started in 1961.

Output:

Pesticides Data

Using the FAO database, pesticides used for each nation and item were also gathered.

dataframe_pesticide = pd.read_csv('pesticides.csv')
dataframe_pesticide.head()

Output:

Now, we will rename the column name Value to pesticides_tones.

Along with it, we will also drop the unwanted column, which is not required for future purposes.

#Renaming the column
dataframe_pesticide = dataframe_pesticide.rename(index=str, columns={"Value": "pesticides_tonnes"})
#Dropping the unwanted column
dataframe_pesticide = dataframe_pesticide.drop(['Element','Domain','Unit','Item'], axis=1)
dataframe_pesticide.head()

Output:

Now we will merge the Pesticides dataframe with the main dataframe

#merging
dataframe_main = pd.merge(dataframe_main, dataframe_pesticide, on=['Year','Area'])
dataframe_main.shape

Output:

Average Temperature

Each country's average temperature was determined using World Bank data.

dataframe_temp= pd.read_csv('temp.csv')
dataframe_temp.head()

Output:

According to our observations, the average temperature ranges from 1743 to 2013, with a few empty rows that we must remove.

Next, we will rename the columns.

dataframe_temp = dataframe_temp.rename(index=str, columns={"year": "Year", "country": 'Area'})
dataframe_temp.head()

Output:

#merging the temperature dataframe with the main dataframe
dataframe_main = pd.merge(dataframe_main,dataframe_temp, on=['Area','Year'])
dataframe_main.head()

Output:

The shape of the main dataframe is changing as we merge other dataframe in it.

Output:

Looking for the null values in the main dataframe.

Output:

Unfortunately, we have six null values in the average_rain_fall_mm_per_year column.

Output:

We need to drop the above rows.

dataframe_main=dataframe_main.dropna()
dataframe_main.isnull().sum()

Output: Crop Yield Prediction Using Machine Learning

Great, no empty values!

3. Data Exploration

From all the merging, we got the dataframe_main as the final obtained dataframe. Now we need to explore it.

#Grouping on the basis of Item
dataframe_main.groupby('Item').count()

Output:

There is a significant fluctuation in the values for each column, which we'll scale down later.

Output:

The 101 Nations in the dataframe are ranked by highest yield output out of 10, so:

Output:

In the dataset, India has the greatest yield output.

Items in the group by include:

dataframe_main.groupby(['Item','Area'],sort=True)['hg/ha_yield'].sum().nlargest(10)

Output:

Cassava and potato output are the greatest in India. With the greatest percentage across four nations, potatoes appear to be the most prevalent crop in the sample.

The final dataframe includes data for 101 nations spanning 23 years, from 1990 to 2013.

Now, that we are looking at the links between the columns of the dataframe, displaying the correlation matrix as a heatmap is a handy approach to rapidly verify correlations among columns.

import sklearn
import seaborn as sns
import matplotlib.pyplot as plt

correlation_data=dataframe_main.select_dtypes(include=[np.number]).corr()

mask = np.zeros_like(correlation_data, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(11, 9))

# Make a unique diverging colour map.
cmap = sns.palette="vlag"

# Create the heatmap with the appropriate aspect ratio and a mask.
sns.heatmap(correlation_data, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});

Output:

The above correlation map shows that there is no association between any of the data frame's columns.

4. Data Preprocessing

Data preprocessing is a method for transforming unclean data into clean data sets. In other words, anytime data is acquired from various sources; it is done so in a raw manner that makes analysis impossible.

Output:

5. Encoding Categorical Variables

The dataframe has two category columns, which are variables that have label values rather than numeric values. The range of available values is frequently constrained to a predetermined set, like in this example, the values for the items and nations.

Several machine learning algorithms are unable to act directly on label data. They demand that all input and output variables be numbers.

Thus, categorical data must be transformed into numerical data. One hot encoding method involves transforming categorical information into a format that may be given to ML algorithms to help them perform better at prediction. These two columns will be converted to a one-hot numeric array for that purpose using the One-Hot Encoding method.

The Numerical Value of the element in the dataset is represented by the category value. For each category, a binary column will be created using this encoding, and the results are returned as a matrix.

from sklearn.preprocessing import OneHotEncoder

dataframe_main_onehot = pd.get_dummies(dataframe_main, columns=['Area',"Item"], prefix = ['Country',"Item"])
features=dataframe_main_onehot.loc[:, dataframe_main_onehot.columns != 'hg/ha_yield']
label=dataframe_main['hg/ha_yield']
features.head()

Output:

#Dropping the year column
features = features.drop(['Year'], axis=1)

features.info()

Output:

6. Scaling Features

The dataset shown above has features with a wide range of magnitudes, units, and ranges. The magnitudes of the features will be far more important in distance computations than the magnitudes of the features.

We must equalize the magnitudes of all characteristics in order to reduce this impact. Scaling can help achieve this.

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
features=scaler.fit_transform(features)

The resultant array will be like this after scaling all values in features and removing the year column:

Output:

7. Training and Test Data

The training dataset and test dataset will be created from the original dataset. Inequality in the data is typically a result of the model's need for as many data points as feasible during training. For train/test, the typical percentages are 70/30 or 80/20.

The first dataset used to teach a machine learning algorithm to learn and make accurate predictions is known as the training dataset. Seventy per cent of the dataset is a training dataset.

Nevertheless, the test dataset is utilized to evaluate how well the ML algorithm was taught using the training dataset. Because the ML algorithm would already "know" the expected output, it would be pointless to test the method by simply reusing the training dataset. A test dataset makes up 30% of the dataset.

from sklearn.model_selection import train_test_split
train_data, test_data, train_labels, test_labels = train_test_split(features, label, test_size=0.2, random_state=42)

8. Comparing and Selecting models

from sklearn.metrics import r2_score
def compare_models(model):
    model_name = model.__class__.__name__
    fit=model.fit(train_data,train_labels)
    y_pred=fit.predict(test_data)
    r2=r2_score(test_labels,y_pred)
    return([model_name,r2])

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import svm
from sklearn.tree import DecisionTreeRegressor

models = [
    GradientBoostingRegressor(n_estimators=200, max_depth=3, random_state=0),
     RandomForestRegressor(n_estimators=200, max_depth=3, random_state=0),
    svm.SVR(),
   DecisionTreeRegressor()
]

model_train=list(map(compare_models,models))

print(*model_train, sep = "\n")

Output:

The R2 (coefficient of determination) regression score function, which will indicate the percentage of variance for items (crops) in the regression model, provides the foundation for the assessment measure. How closely words (data points) fit a curve or line is shown by the R2 score.

R2 is a statistical index that ranges from 0 to 1, measuring how closely a regression line resembles the data it is fitted to. If it's 1, the model accurately predicts 100% of the variation in the data; if it's 0, the model accurately predicts 0% of the variance.

According to the findings shown above, Gradient Boosting Regressor comes in second place with an R2 score of 96%, followed by Decision Tree Regressor.

We'll also do the math. Nevertheless, adjusted R2 accounts for the number of terms in a model and still shows how well terms match a curve or line. Adjusted r-squared will drop when more pointless variables are included in a model. A higher number of meaningful variables will result in a higher adjusted r-squared. Adjusted R2 will never be more than R2 or the same as it.

dataframe_main_onehot = dataframe_main_onehot.drop(['Year'], axis=1)
dataframe_main_onehot.head()

Output:

# converting test data to columns from the dataframe and omitting the values for "hg/ha yield," which the machine learning model should be predicting
dataframe_test=pd.DataFrame(test_data,columns=dataframe_main_onehot.loc[:, dataframe_main_onehot.columns != 'hg/ha_yield'].columns)

# utilizing the stack function to pivot the columns of the current dataframe and return a reshaped dataframe

cntry=dataframe_test[[col for col in dataframe_test.columns if 'Country' in col]].stack()[dataframe_test[[col for col in dataframe_test.columns if 'Country' in col]].stack()>0]
cntrylist=list(pd.DataFrame(cntry).index.get_level_values(1))
countries=[i.split("_")[1] for i in cntrylist]
itm=dataframe_test[[col for col in dataframe_test.columns if 'Item' in col]].stack()[dataframe_test[[col for col in dataframe_test.columns if 'Item' in col]].stack()>0]
itmlist=list(pd.DataFrame(itm).index.get_level_values(1))
items=[i.split("_")[1] for i in itmlist]


dataframe_test.head()

Output:

dataframe_test.drop([col for col in dataframe_test.columns if 'Item' in col],axis=1,inplace=True)
dataframe_test.drop([col for col in dataframe_test.columns if 'Country' in col],axis=1,inplace=True)
dataframe_test.head()

Output:

dataframe_test['Country']=countries
dataframe_test['Item']=items
dataframe_test.head()

Output:

from sklearn.tree import DecisionTreeRegressor
clf=DecisionTreeRegressor()
model=clf.fit(train_data,train_labels)

dataframe_test["yield_predicted"]= model.predict(test_data)
dataframe_test["yield_actual"]=pd.DataFrame(test_labels)["hg/ha_yield"].tolist()
test_group=dataframe_test.groupby("Item")

# So let's compare the model's actual values to its predictions.

fig, ax = plt.subplots()

ax.scatter(dataframe_test["yield_actual"], dataframe_test["yield_predicted"],edgecolors=(0, 0, 0))

ax.set_xlabel('Actual')
ax.set_ylabel('Predicted')
ax.set_title("Actual vs Predicted")
plt.show()

Output:

Model Results & Conclusions

varimp= {'imp':model.feature_importances_,'names':dataframe_main_onehot.columns[dataframe_main_onehot.columns!="hg/ha_yield"]}

a4_dims = (8.27,16.7)
fig, ax = plt.subplots(figsize=a4_dims)
df=pd.DataFrame.from_dict(varimp)
df.sort_values(ascending=False,by=["imp"],inplace=True)
df=df.dropna()
sns.barplot(x="imp",y="names",palette="vlag",data=df,orient="h",ax=ax);

Output:

Obtaining only the top 7 characteristics in the model's significance list:

#7 most important factors that affect crops
a4_dims = (16.7, 8.27)

fig, ax = plt.subplots(figsize=a4_dims)
df=pd.DataFrame.from_dict(varimp)
df.sort_values(ascending=False,by=["imp"],inplace=True)
df=df.dropna()
df=df.nlargest(7, 'imp')
sns.barplot(x="imp",y="names",palette="vlag",data=df,orient="h",ax=ax);

Output:

#Boxplot that shows yield for each item
a4_dims = (16.7, 8.27)

fig, ax = plt.subplots(figsize=a4_dims)
sns.boxplot(x="Item",y="hg/ha_yield",palette="vlag",data=yield_df,ax=ax);

Output:

As potatoes are the highest crop in the dataset, they are given the most weight in the model's decision-making process. In the case of sweet potatoes, we observe some of the crops with the greatest feature value in the dataset, along with cassava, where the influence of pesticides is the third most important feature.

Given that India has the most crops in the dataset, it makes sense if the crop is farmed there. Rainfall and temperature follow. The model's expectations for the predicted crop yield were all significantly impacted by these variables, proving that the original assumption about them was accurate.

In conclusion, crop yield prediction using machine learning has the potential to revolutionize the agriculture industry. By providing more accurate predictions, improving decision-making, increasing efficiency, and enhancing sustainability, this technology can help farmers to achieve better yields and more profitable businesses. While there are some challenges to using machine learning for crop yield prediction, the benefits are clear, and we can expect to see continued advancements in this field in the years to come.

Next TopicData Visualization in Machine Learning

← prev next →