pdicting Rideshare Fares using Python

The taxi service market has been flourishing recently, and substantial expansion is pdicted shortly. Numerous businesses have emerged to cater to this increased demand for cab tours. Few businesses, nevertheless, charge more for the same tour. Customers are forced to pay excessively, even if the costs need to be lower. The major goal is to pdict tour expenses before making a taxi reservation to maintain openness and pvent unfair practices.

Project initiatives:

  • Our project enables users to calculate the cost of a taxi journey by considering various dynamic factors, including the weather, the availability of cabs, cab size, and the distance to travel between two sites.
  • An existing data set is used to build an equation that captures key trends.
  • This model is used to make future pdictions or suggest the best pdictions.
  • This system has been implemented using a variety of approaches, including machine learning, controlled learning, regression, random forests, and parameter adjustment (improving model accuracy).
Predicting Rideshare Fares using Python

The first significant American city to reveal detailed ridesharing statistics from firms like Lyft, Uber, and Via was Chicago_city. The information initially became public in April 2019 and pertained to journeys conducted since November 2018. The tours, motorists, and vehicle databases can provide information on the pricing strategies used by rideshare companies as well as insights into the behavior of the passengers.

A few articles are on pricing (Reuter-Uber drivers raise fares) and passenger behavior (Rideshare Data). Reuter's investigation indicated that the price hikes for shared rides mostly impact Chicago_city's low-income neighborhoods. At the same time, Storybench's study found that journeys typically concentrate around early night commuting hours and "nightlife" hours. These are the contexts in which I am working to develop artificial intelligence models that forecast ridesharing prices.

The Dataset

Each journey's details are included in the tour data, such as the start time, finish time, distance traveled, starting and ending points, etc. You can get more thorough data explanations and the data's source from online sources.

Chicago_city does many data modifications, including suppssing Census Tracts and rounding times to the closest 15 minutes. The closest $2.50 is added to the fare, and $1 is added to the tip. The modeling data includes more than 7 million rows and consists of travels performed in December 2019.

countstd25%15%
Trip Mites624208606617452'.000000e+001.786.6516
Pickup Census Tract62420860111111112E+16111113456
Dropoff Census Tact59482040111111112E+16111111.23
Pickup Community Area6226706019,003955+.0000008+008,00+003.02+01
Dropotf Commun5931854012307615,+.0000006+00111111111
hours624208502.852403,0.000000+005.0�+0011111
Tip6242085017817900.000000�+0011.0000
Additional Charges6242085011958999111111112.50+002.002+00
'Trip Total62420850tori0116.0000002+007.02+001.585+0
Trips Pooled624208600.437232+.0000006+00111+.00000
Pickup Centroid Latitude623368600.0486554,165022e+01-49�+01111
Pickup Centroid Longitude623368600.060790-8.7903046+011111-9E+7
Dropoff Centroid Latitude5837303000468724.1650228+014.4564.34
Dropoff Centroid Longitude583730300.05690611111111111-8,7

Weather Data

NOAA (National Centres for Environmental Information) is the source of the weather information for Chicago_city for December 2019, including pcipitation, temperature, hourly visibility, hourly wind direction, and hourly wind speed. All information about Chicago_city is collected from a station located at O'Hare International Airport for the sake of simplicity.

Data Wrangling

Since the weather data period is erratic, the data must be reconfigured to a 15-minute evenly spaced time series before being coupled with the tour date. Here is some code that will space the data equally.

The start and finish timings of the journey were entered into RStudio as factors, with the night and afternoon times being expssed in a 12-hour format. These must be transformed into dates with a local timezone and a 24-hour format. For the travels, we additionally defined variables for the riding day, hour, day of each week, and date.

Source Code Snippet

Output: After filling in missing values, the weather data looks like this:

datetemppcipitationHourlyVisilHourlyWindspeed
2019-12-0100:15:0039.00.04.978.0
2019-12-0100:30:0039.00.04.978.0
2019-12-0100:45:0039.00.04.978.0
2019-12-0101:00:0039.00.07.007.0
2019-12-0101:15:0039.00.07.008.0

Visualize

To make sure there are no errors, gaps in the data, etc., we pfer to start by visualizing the complete dataset. The three programs, skimr, visdat, and inspectdf, are excellent. A wide range of tools for displaying your data and underlying factor distributions are included in all three packages.

Source Code Snippet

Output:

Predicting Rideshare Fares using Python

Source Code Snippet

Output:

Predicting Rideshare Fares using Python

Visualize the tours by an hour of the day

We want to see tours across two levels (the week, days or and time of the day). The picture below displays the number of tours taken per hour across the days of the week.

Specifically, therides.chicago_citydata frame is piped (%>%) over to thegggplot2 functions to create histograms and then faceted by the days of the week to show the rides-per-hour breakdown across each day.

Source Code Snippet

Output:

Predicting Rideshare Fares using Python

The plot below shows the tips given at different tour durations. We can sample our data usingdplyr: :sample_frac() function for a more manageable data set. We group these data by the two variables of interest (tipperandride_category1), then create a mean of the tour duration (mean_tour_mins1) for a more interptable visualization across these groups.

Source Code Snippet

Output:

Predicting Rideshare Fares using Python

Motivating passengers to tip is another payment source that benefits drivers. Tipping is less common than not tipping, at this point where knowing more about the metrics influencing tip behavior could be point of interest.

ML Models

We evaluate three well-known tree-based models: model name- Random Forest, model name- gradient booster, and model name- XG Boost. Below are some code snippets for each model's setup, along with a brief overview of each one.

1. Rough Forest

A group of decision trees is known as a random forest. A random sample of the dataset is used to train each decision tree. Then, using ensemble techniques, a forecast is made using the entire forest by averaging the pdictions of the trees.

Source Code Snippet

2. Gradient Boosting Machines

Another ensemble technique built on decision trees is GBM. Sequentially including trees makes an effort to boost the theatricality of the group.

Source Code Snippet

3. XGBSoost

Another ensemble approach that employs an augmenting gradient framework based on decision trees is XGBS. Because XGBSoost includes so many complex parameters, it's crucial when utilizing XGBS to tune the hyper-parameters to select the best configuration.

Source Code Snippet

Results

These tree-based models have strong pdictive abilities, as shown by R-squared values higher than 95% acquired from test datasets. It should be no surprise that tour miles and seconds are the two most crucial factors. The value of weather-related data needs to be higher. The use of temperature and pcipitation data in this context without any modifications, such as considering variations in pcipitation over time, may have reduced the pdiction ability of such variables.

modelsR2
Random forest93.7%
GBM93.6%
XGB91%
Predicting Rideshare Fares using Python

Trip miles are the most significant attribute when visualizing a Random Forest model tree.

Predicting Rideshare Fares using Python

Next steps

What do we notice?

Rideshare excursions typically occur during "nightlife" hours and early morning commuting times. Unsurprisingly, Fridays and Saturdays see a particularly large increase in "nightlife" hours, whereas Sunday night sees a marked decrease.

Furthermore, behavioral gaps affect how engaged our passengers are with the goods and their drivers. Tipping is one of those behaviors. Overall, tipping is uncommon, but the time of day impacts a passenger's inclination to tip more than the length of the tour. Longer travels frequently occur early in a week, which raises the possibility that a passenger may need to make an initial tour for the week.

Thanks to these visualizations, we identified certain trends and connections between time, frequency, and behavior in the Chicago_city ridesharing data. The next step may be a static report, Ppt psentation, or PDF. In a perfect world, we could develop an intervention, plan an experiment, and create a dashboard displaying ongoing research findings and real-time data.

Conclusion

Machine learning models based on trees are tested and evaluated to determine how well they can forecast ridesharing prices. Even though these models have excellent forecasting abilities, more gains can be made by transforming weather-related variables and using more pcise location data.