Last semester I worked with a group of students from Cornell on a project to see if we could use Machine Learning to predict taxi demand. Without much prior ML experience the students were able to succeed in using Apache PredictionIO and a variety of algorithms to fairly accurately predict taxi demand in New York.
PredictionIO is an open source Machine Learning framework from Salesforce that provides an out-of-the-box experience with everything you need to start doing ML. Last year we contributed the project to the Apache Foundation in order to build a larger ecosystem around it. Along with the tons of templates there is also the Dreamhouse sample application which uses PredictionIO with Salesforce data to recommend houses to perspective buyers.
The foundation for the project the Cornell students worked on is a data pipelines sample application called Koober. I haven’t written up much about Koober yet but here is the basic gist: Koober is a reference application (think Java Pet Store) for a modern data pipeline system that uses the ride sharing use case. Koober brings together a real-time web UI, Kafka for event streaming, Flink for data analytics, and Spark Streaming & Cassandra for data storage. Stay tuned for more information about this sample app.
Now wouldn’t it be cool to add Machine Learning to that pipeline? Of course! And I had some really bright students to help do exactly that. But right away there were some challenges. To learn we needed a real, historical taxi usage dataset and Koober, being a sample application, didn’t have any real data. Luckily we found a public dataset for exactly this!
With the historical taxi usage we now needed to determine what “features” mattered for predicting demand. Location, date / time, and weather seemed like the most important ones. The public dataset had location and date / time, but lacked weather. The students found a way to get historical weather data and decided on how to break down location and date / time into actual ML features. Most ML algorithms can’t just figure out why a date / time matters, it needs to be broken down into quantifiable and correlatable data points. For instance, what about date / time matters? Day of the week, time of day, etc. But for time of day, what granularity matters? By second, minute, hour? The students figured out how to break all of this down into meaningful features.
Using PredictionIO, the students used a variety of algorithms, with the dataset and the features, to analyze the data and make predictions from it. Here is a short demonstration of what they built:
Also check out the slides from their final presentation:
Now that you’ve seen the great stuff the students created, let’s meet some of the students and hear what they learned with this project.
Annie Cheng — Cornell Class of 2018 — CS Major — Interested in Software Engineering and Product Management
This past semester, I designed and developed the frontend of the Koober website and the dashboard interface, as well as integrated the application with a location API platform called Mapbox.
One interesting challenge we faced early on was how we could design our application to satisfy our target users’ two main information needs: data analysis and demand prediction. As a team, we decided to build two dashboards — one for each information need. The Analysis Dashboard would display past taxi data through informative demand heat maps produced by several trained models for easy evaluation against the actual demand. Meanwhile, the Prediction Dashboard would allow users to visualize future taxi demand by adjusting time, location, and weather. Based on these considerations and the open-source nature of our project, I created an About page highlighting our mission and Github repo, a Documentation page explaining our data and models, and a Dashboard page allowing users to analyze and predict taxi demand.
As our website functions primarily as a data visualization tool to provide users with knowledge, I also focused on incorporating user interactivity with the data through Mapbox’s heat map integration on the dashboards. Taking in user-specified inputs, I passed these parameters through the backend machine learning models, which then outputted a series of demand values that each corresponded to a particular location. To turn this output into interpretable results, I then displayed the data points on a live map in the form of circular dots, with the size indicating the covered region and the color signifying the magnitude of demand. This map feature, displayed on the Analysis and Prediction dashboards, resulted in a more streamlined interface and a better user experience.
Overall, I had an amazing experience collaborating with my teammates and our mentor to develop a working and useful web application from scratch. In particular, I enjoyed learning about the many aspects of contributing to open source projects such as code review conventions. Now, I’m even more excited to actively participate in the open source community!
Brett Clancy — Cornell Class of 2018 — CS Major, B.Sc / M.Eng — Interested in Software Engineering and Financial Technology
I primarily worked on developing the backend of Koober, the PredictionIO engine that analyzed and predicted taxi demand. While we began with an existing PredictionIO template that was similar in purpose to ours, there was a lot that needed to be changed. Our engine needed to receive, format, and aggregate data from the massive dataset of NYC yellowcab travels. Additionally, it needed to analyze and predict demand based on a larger number of input variables, such as time of day, day of the week or month, and weather, in addition to location.
Along with setting up the engine itself, I also added the Random Forest Model to the multitude of machine learning algorithms that the engine utilizes. Done with the Apache Spark ML library, this algorithm forms a prediction model of multiple decision trees, each of which contains a random subset of the data. These trees all form predictions independently, which are then aggregated to form a single prediction value.
Finally, I presented Koober as part of BOOM (Bits On Our Minds), “the premier annual showcase for Cornell student projects in cutting-edge digital technology”. While Koober was still a work-in-progress at the time, I immensely enjoyed presenting it to everyone there; they were all super excited to hear about our project and what our team was able to do.
Koober was an overall blast to work on. I learned a lot about machine learning and open source development, and was able to do so with an amazingly talented group of individuals. A big thanks to my teammates and James for everything; I wish all of you the best of luck.
Yiting Wang — Cornell Class of 2018 — CS Major — Interested in Software Engineering
For Koober, I worked on both the backend Prediction IO Engine and connecting the backend with the frontend using Mapbox.
For the backend, I worked with the others to modify an existing PredictionIO engine framework to work with our training dataset. Specifically, because each timestamp in the training data corresponds to a demand of 1, I aggregated the demand based on a time range (e.g. half an hour), so that all timestamps that fall into that time range would have the same aggregated demand of the number of timestamps in that range from the data. In addition, I used the existing Apache Spark ML Library to include Gradient Boosted Trees as an ML algorithm that can be used for training, and modified the Prediction IO engine so that it would be able to run on multiple ML algorithms to give a prediction for each algorithm.
Since the frontend website and backend Prediction IO engine were developed independently during the semester, they inevitably needed to be brought together as a whole to make the complete Koober project. Therefore, I created the communication pathway between them in the prediction page by collecting the information that users inputted in the website, making a call to the backend using the Mapbox API to receive the predicted demand, and finally displaying the demand to the users as heat maps. This communication step is crucial, as it is what connects the question that users want to ask to the answers that our machine-learning algorithms provide.
During the semester, I had an awesome time implementing Koober and seeing it built piece by piece into this final product. I was taking a Machine Learning course concurrently and it was really interesting to be learning the theory behind many Machine Learning algorithms while being able to use a lot of them in this project, allowing us to learn from practice as well. In addition, this is the first open source project I participated in, and it introduced the whole world of open source to me. It was so thrilling to make my very first Pull Request and see it through to being merged. I had great teammates and an awesome mentor, and without any of them, Koober would not be the way it is now.
Huge thanks to Cornell and the amazing students who did such a great job working through a number of challenges but ultimately built something to be proud of!