SYRF has the largest dataset of sailing-relating data on earth. We’ve amassed metadata for over 100k races that span the globe and the last 20 years. In this post, we’re going to give an overview of the type of data we have, and show you how to get it in case you want to play with it yourself.

Our entire dataset is over 1TB in size, but thankfully we’re able to export metadata slices that take up much less space. What data do we have? We have the bounding boxes for nearly every tracked race for the last 20 years. Additionally, we have the lat/lon coordinates of the approximate start location, as well as the start time. With this dataset, we can use any existing geospatial data visualization tool to look for trends.

How might you use this data?

The most obvious way to use this data is for sailing brands and organizations to inform their digital strategies with geospatial awareness. For instance, targeting advertisements to areas where sailing is very popular, rather than wasting money on bids in areas where sailing is not very popular.

Another interesting use case for this data is to look for regions that have seen the largest growth in the sport over the last few years, such as in South America or APAC. You may find this useful if you’re planning an event or broadcast. One may also observe the impact of Covid on the sport, or note the rise in popularity of digital tracking apps (the source of our data set).

Perhaps more interesting, would be to use this data to run meteorological calculations and find regional averages – kind of like a modern Sailing Instructions.

SYRF is excited to release this data because it simply hasn’t been available before and we know the community will come up with innovative applications.

The Data

The first data set is a set of nearly 80,000 race starts. Each start has a location, a start time, day, month, year and country. The data view is set to animate a window over time, and you can clearly observe a dip in March of 2020. Hovering over a dot shows you additional details about the race while the color of the dot indicates the country.

We also made this data available as a cluster visualization:

For every race, we found a bounding box containing the positions of all the boats. Then we filtered all bounding boxes that were greater than 10 km^2 and removed them from the data set. Finally, we dissolved all the bounding boxes so that we were left only with polygons that indicate a spatial extent of sailing races. What we’re left with are spatially polygons that show you where people are actually racing, i.e. where the boats are moving. Crucially, we were able to count up the number of races contained in each polygon. Sorting the regions by race count will enable one to prioritize any kind of geospatial analysis according to the popularity of the region. This data would be useful in the context of weather and current statistical analysis.

What if you didn’t want irregularly shaped polygons? For instance, maybe you wanted nice rectangular coordinates so that you could use a command line utility to extract the weather from a grib file? By finding bounding boxes for the irregular polygons described above, and recursively dissolving and re-bounding the results, we were able to create a dataset of pure rectangles that represents where races are happening.

Lastly, and just for fun, we decided to plot our library of yacht clubs in 3D hexbins where the height represents the number of clubs contained in the hexbin.

This is just the beginning. We’ve got some amazing things planned for the next 6 months so be sure to subscribe for more updates like this.