Skip links
Image collection for machine learning datasets

7 Steps to preparing your image collection for machine learning datasets

While a nascent industry emerges on the back of artificial intelligence, a new opportunity exists for content owners to deliver their material to train AI. Preparing datasets for machine learning has its quirks and this article provides a step-by-step guide to selling your content as datasets that will be sold on this site for commercial use in AI training. Your content will be used for training only, and when licensed will not allow the purchaser to use it for any other purpose. 

Step #1 Audit your content

This step can be likened to getting your house in order before putting it on the market.

If you are an individual content creator then you need to understand what content assets you have that are suitable for preparing datasets for AI training.

The first consideration is about quantity – do you have enough images to make it worthwhile to offer your images or video as part of a dataset? Our minimum requirement is 1000 images.

Secondly, you need to consider if you have a good specialty that will make your datasets interesting for training. For example, you might have a large collection of praying mantis images. If you do not have a distinct specialty, then you can still contribute images to datasets consisting of images contributed by other content creators.

Step #2 Review your legal compliance

This phase is a subset of your content audit as it will determine what parts of your library can, or cannot, be used for machine learning.

You need to make sure your images/video can legally be licensed. Two of the most important things to check are in regard to privacy and copyright permissions.

If your images or videos have people in them who can be identified personally, then you need to have a biometric model release signed by the individual. Read more about GDPR and BIPA compliance.

If your files have any landmarks or trademarks. Refer to this article to learn more about using photographs of copyrighted works or trademarks. You might find it interesting to see this comprehensive stock industry wiki to landmarks, brands and trademarks.

These laws are standard in the stock industry, so if you already sell the materials as stock, then you are likely to understand these requirements and to be in compliance – although the issue of biometric releases for people featured in the content is key to compliance in AI training.

In terms of copyright, as a content creator, you are likely to have the full rights to license your content, unless you were undertaking the work for a commercial shoot and the client owns the rights.

If you are a stock agency, that represents photographers and videographers, we recommend getting opt-in permission from your contributors to sell their content for AI training.

You will need to request this for all existing content by reaching out to your contributors and informing them about this opportunity.  For future uploads, we suggest adding a checkbox “allow for training purposes” as part of the onboarding process.

Step #3 Selection and sorting

Most content will be interesting to machine learning as they are hungry beasts needing a lot of training data. 

Having a large quantity of one subject is an advantage.

Alternatively, you can upload a selection of files and you will receive a royalty share from payments according to the quantity of media included in the dataset.

Step #4  File format and naming

There are three key considerations to file preparation.

1. Format

  • We accept image files with the following formats: JPG, PNG
  • We accept video files with the following formats: MOV, MP4 as containers, H264 codec of choice.

2. Aspect ratio

  • This is a little unusual, but machine learning requires images to be the same aspect ratio (a square). However, you can supply your files as is and we will implement the automated scripts to cut files and prepare them for training.

3. File naming

  • The files need to be named so that they match the txt file name which contains the meta tags (see Tagging below). 

We can provide some automated solutions to assist you with preparing your files for datasets.

Step #5 Metatags and descriptions

Images need to be delivered with appropriate metadata which describes the content of the image.

Most important here are the image tags. We require anywhere between 10 and 50 keywords for each provided image.

The description and title of the image can be provided, but it is not obligatory to provide this information. 

Metadata can be provided either in the IPTC header of the image file itself or as separate files associated with the image. In the latter case, you may use a .txt or .json file with the same file name as the image file. 

Each keyword should be separated by a comma or, in this case of IPTC embedded metadata, a comma or semicolon.

Example:

  • germanlandscape55.jpg
  • germanlandscape55.txt / germanlandscape55.json

A CSV file may also be provided for the metadata. Please contact us if you are interested in this solution.

We can provide support during this process. Please reach out via our Live Chat feature, or file a Help Ticket.

Step #6 Delivering the files

The files can be delivered in a few different ways, depending on your setup and also the nature of the files you want to deliver.

The most convenient way to deliver your files is to share a cloud drive with us (for example, Google Drive or pCloud). We can transfer the files directly from these directories to our servers.

For smaller batches, you might consider using a file transfer system like WeTransfer, or FTP file delivery. We suggest this method for collections up to 100GB. 

Larger deliveries can be done via physical hard drive – we will provide you with the address.

Where API access exists, we will work with your team to allow for the transfer to happen digitally.  This is the way we work with large agencies to transfer their libraries to our servers. 

When you signup as a contributor we will provide you with upload instructions for how to deliver your content.

There is a review process once you upload your files, and we will provide you with a final selection and any explanation of why some files might not have been included.

Step #7 Sales, reporting and revenue sharing

Now you have prepared your files and delivered them to datasetshop.com you will be able to start promoting your data for sale.

You can either directly promote the featured dataset page, or link people to your contributor page to see all the datasets where your work is available.

When sales occur you will be notified via the email you provided as your primary contact. You will receive automatic payments at the end of each month to the account provided.