How Neural Networks Power Robots at Starship

Starship is building a fleet of robots to deliver packages locally on demand. To successfully achieve this, the robots must be safe, polite and quick. But how do you get there with low computational resources and without expensive sensors such as LIDARs? This is the engineering reality you need to tackle unless you live in a universe where customers happily pay $100 for a delivery.

To begin with, the robots start by sensing the world with radars, a multitude of cameras and ultrasonics.

However, the challenge is that most of this knowledge is low-level and non-semantic. For example, a robot may sense that an object is ten meters away, yet without knowing the object category, it’s difficult to make safe driving decisions.

Machine learning through neural networks is surprisingly useful in converting this unstructured low-level data into higher level information.

Starship robots mostly drive on sidewalks and cross streets when they need to. This poses a different set of challenges compared to self-driving cars. Traffic on car roads is more structured and predictable. Cars move along the lanes and don’t change direction too often whereas humans frequently stop abruptly, meander, can be accompanied by a dog on a leash, and don’t signal their intentions with turn signal lights.

To understand the surrounding environment in real time, a central component to the robot is an object detection module — a program that inputs images and returns a list of object boxes.

That’s all very well, but how do you write such a program?

An image is a large three-dimensional array consisting of a myriad of numbers representing pixel intensities. These values change significantly when the image is taken at night instead of daytime; when the object’s color, scale or position changes, or when the object itself is truncated or occluded.

Left — what the human sees. Right — what the computer sees.

For some complex problems, teaching is more natural than programming.

In the robot software, we have a set of trainable units, mostly neural networks, where the code is written by the model itself. The program is represented by a set of weights.

At first, these numbers are randomly initialized, and the program’s output is random as well. The engineers present the model examples of what they would like to predict and ask the network to get better the next time it sees a similar input. By iteratively changing the weights, the optimization algorithm searches for programs that predict bounding boxes more and more accurately.

Evolution of programs found by the optimization procedure.

However, one needs to think deeply about the examples that are used to train the model.

Should the model be penalized or rewarded when it detects a car in a window reflection?
What shall it do when it detects a picture of a human from a poster?
Should a car trailer full of cars be annotated as one entity or each of the cars be separately annotated?

These are all examples that have happened whilst building the object detection module in our robots.

Neural network detects objects in reflections and from posters. A bug or a feature?

When teaching a machine, big data is merely not enough. The data collected must be rich and varied. For example, only using uniformly sampled images and then annotating them, would display many pedestrians and cars, yet the model would lack examples of motorcycles or skaters to reliably detect these categories.

The team need to specifically mine for hard examples and rare cases, otherwise the model would not progress. Starship operates in several different countries and the varying weather conditions enriches the set of examples. Many people were surprised when Starship delivery robots operated during the snowstorm ‘Emma’ in the UK, yet airports and schools remained closed.

The robots deliver packages in various weather conditions.

At the same time, annotating data takes time and resources. Ideally, it’s best to train and enhance models with less data. This is where architecture engineering comes into play. We encode prior knowledge into the architecture and optimization processes to reduce the search space to programs that are more likely in the real world.

We incorporate prior knowledge into neural network architectures to get better models.

In some computer vision applications such as pixel-wise segmentation, it’s useful for the model to know whether the robot is on a sidewalk or a road crossing. To provide a hint, we encode global image-level clues into the neural network architecture; the model then determines whether to use it or not without having to learn it from scratch.

After data and architecture engineering, the model might work well. However, deep learning models require a significant amount of computing power, and this is a big challenge for the team because we cannot take advantage of the most powerful graphics cards on battery-powered low-cost delivery robots.

Starship wants our deliveries to be low cost meaning our hardware must be inexpensive. That’s the very same reason why Starship doesn’t use LIDARs (a detection system which works on the principle of radar, but uses light from a laser) that would make understanding the world much easier — but we don’t want our customers paying more than they need to for delivery.

State-of-the-art object detection systems published in academic papers run around 5 frames per second [MaskRCNN], and real-time object detection papers don’t report rates significantly over 100 FPS [Light-Head R-CNN, tiny-YOLO, tiny-DSOD]. What’s more, these numbers are reported on a single image; however, we need a 360-degree understanding (the equivalent of processing roughly 5 single images).

To provide a perspective, Starship models run over 2000 FPS when measured on a consumer-grade GPU, and process a full 360-degree panorama image in one forward pass. This is equivalent to 10,000 FPS when processing 5 single images with batch size 1.

Neural networks are better than humans at many visual problems, although they still may contain bugs. For example, a bounding box may be too wide, the confidence too low, or an object might be hallucinated in a place that is actually empty.

Potential problems in the object detection module. How to solve these?

Fixing these bugs is challenging.

Neural networks are considered to be black boxes that are hard to analyze and comprehend. However, to improve the model, engineers need to understand the failure cases and dive deep into the specifics of what the model has learned.

The model is represented by a set of weights, and one can visualize what each specific neuron is trying to detect. For example, the first layers of Starship’s network activate to standard patterns like horizontal and vertical edges. The next block of layers detect more complex textures, while higher layers detect car parts and full objects.

The way how the neural network we use in robots builds up the understanding of images.

Technical debt receives another meaning with machine learning models. The engineers continuously improve the architectures, optimization processes and datasets. The model becomes more accurate as a result. Yet, changing the detection model to a better one doesn’t necessarily guarantee success in a robot’s overall behaviour.

There are dozens of components that use the output of the object detection model, each of which require a different precision and recall level that are set based on the existing model. However, the new model may act differently in various ways. For example, the output probability distribution could be biased to larger values or be wider. Even though the average performance is better, it may be worse for a specific group like large cars. To avoid these hurdles, the team calibrate the probabilities and check for regressions on multiple stratified data sets.

Average performance doesn’t tell you the whole story of the model.

Monitoring trainable software components poses a different set of challenges compared to monitoring standard software. Little concern is given regarding inference time or memory usage, because these are mostly constant.

However, dataset shift becomes the primary concern — the data distribution used to train the model is different from the one where the model is currently deployed.

For example, all of a sudden there may be electric scooters driving on the sidewalks. If the model didn’t take this class into account, the model will have a hard time correctly classifying it. The information derived from the object detection module will disagree with other sensory information, resulting in requesting assistance from human operators and thus, slowing down deliveries.

A major concern in practical machine learning — training and test data come from different distributions.

Neural networks empower Starship robots to be safe on road crossings by avoiding obstacles like cars, and on sidewalks by understanding all the different directions that humans and other obstacles can choose to go.

Starship robots achieve this by utilizing inexpensive hardware that poses many engineering challenges but makes robot deliveries a strong reality today. Starship’s robots are doing real deliveries seven days a week in multiple cities around the world, and it’s rewarding to see how our technology is continuously bringing people increased convenience in their lives.