Creating a dataset

Quick guide on how to create a dataset

Dataset & Documents

A collection of documents is called a dataset.

Documents can be described as Python dictionaries and is a unit of data in Relevance - similar to a row in a SQL table or a document in MongoDB collection.

Documents are stored in Relevance AI as a JSON. By storing data in this format you can store complex structures that are common to unstructured data. Example:

  "_id" : 1,
  "title" : "Apple IPhone 13 Pro",
  "product" : {
    "image_url" : ""
    "price" : 1699
  "title_ner" : [
    {"substring":"Apple", "label":"Organization", "probabilty":0.98}, 
    {"substring":"IPhone", "label":"Product", "probability":0.97}, 
  "title_labels" : [
    {"label":"Phone", "probability":0.93}, 
    {"label":"Electronic", "probabilty":0.80}, 
  "words" : ["Apple", "IPhone", "13", "Pro"],
  "title_word_vector_" : [0.12, 0.34, ...],
  "product_image_vector_" : [0.12, 0.34, ...],
  • A document can be nested (dictionary of dictionaries) or have array (list) values.
  • A field refers to the key of a dictionary.
Schema RuleExample
Each document should include an _id string field that is used to uniquely identify each document, if 2 documents have the same _id the last inserted document will overwrite the old one.{"_id": "j2hfio23j"}
If you are inserting vectors. The field must end (suffix) with _vector_. All vectors under a field must be the same length. You can store as many different vectors as you want in a single document.{"word_vector_":[0.12, 0.34, ...]}


Dataset Limitations!

  • You cannot rename datasets or rename/edit existing field names. However, you can clone datasets and edit field names in the clone using the clone feature.
  • A Dataset name cannot contain spaces or capital letters.