8

I'm trying to find multi-label classfication datasets, which are available for free online.

By "multi-label" I mean that each instance can be labeled with anywhere from a single to $k$ labels, where $k$ is the total number of different labels in the dataset. Typically all information about the labels would be represented in a binary matrix $\mathbf{M}$, where $\mathbf{M}_{ij}=1$ if instance $i$ has label $j$, and $0$ otherwise.

I've found the following two datasets so far:

  1. iMaterialist Challenge (Fashion) at FGVC5 from Kaggle.com
  2. DeliciousMIL: A Data Set for Multi-Label Multi-Instance Learning with Instance Labels Data Set

I've also looked at the Mulan multi-label datasets page, but they are pretty opaquely (and sometimes erroneously) described.

Where can I find more multi-label datasets (preferably with 20-200 different labels in total)?

Bobson Dugnutt
  • 185
  • 1
  • 8
  • Have you tried kaggle datasets page? https://www.kaggle.com/datasets . also you can check this site http://academictorrents.com/ – Kaustubh Jul 02 '18 at 09:12
  • 1
    @Kaustubh Thanks for the links. Yes, I've tried Kaggle's database; unfortunately Kaggle doesn't have a multilabel-classification tag (only a multiclass-classification tag), so it is really difficult to find the multilabel stuff. Do you know how to search for multilabel specifically on academictorrents.com? – Bobson Dugnutt Jul 02 '18 at 09:15
  • 1
    Unable to think of how to do that, but one thing that comes to my mind is you can search for object detection datasets, often in object detection problems the image is tagged with multiple objects (labels), so I think if you take any object detection dataset that may serve your purpose. You can ignore the bounding boxes and just keep the label. – Kaustubh Jul 02 '18 at 09:31
  • @Kaustubh True, I briefly looked at the object-detection tag as well, unfortunately many of those are multiclass as well, but the idea is good! – Bobson Dugnutt Jul 02 '18 at 09:39
  • There are many multi-label image data-sets... – DuttaA Jul 02 '18 at 12:47
  • @DuttaA Hehe, that's great! Where though? – Bobson Dugnutt Jul 02 '18 at 12:52
  • https://deeplearning4j.org/opendata – DuttaA Jul 02 '18 at 13:06
  • @DuttaA Hmm, to me most of these standard image datasets looks multiclass, not multilabel. Do you know of any specifically multilabel? – Bobson Dugnutt Jul 02 '18 at 13:13
  • multilabel is different form multiclass? – DuttaA Jul 02 '18 at 13:15
  • Right i see..no i have no idea – DuttaA Jul 02 '18 at 13:16
  • You can also make a synthetic multilabel dataset with `sklearn.datasets.make_multilabel_classification` – 00schneider May 27 '22 at 16:49

3 Answers3

3

You can find a complete repository of around 80 multi-label datasets here :

Adept
  • 854
  • 5
  • 17
Eva Gibaja
  • 46
  • 1
2

Try, Kaggle Toxic Comments Challenge. You have to classify the answer to multiple classes at the same time. It is a multi-label classification problem.

tenshi
  • 626
  • 4
  • 6
  • Thank you for the suggestion. While it is true that the dataset is technically multi-label, many of the instances don't have any labels at all, so it unfortunately doesn't meet the criterion of being labeled with anywhere from a single to $k$ labels. – Bobson Dugnutt Jul 02 '18 at 10:25
  • Just remove the observations which don't belong to any labels. That should make you happy right? – tenshi Jul 02 '18 at 10:28
  • Not really, there are still only different 6 labels, and they seem to be very heavily correlated. Still, thank you for the suggestion, it is definitely better than nothing :) – Bobson Dugnutt Jul 02 '18 at 10:32
  • Cool, glad you liked it. You could end this answer, or leave it open for more suggestions from the community. – tenshi Jul 02 '18 at 10:33
  • Yeah, I think I'll leave it open, as I would like more suggestions. – Bobson Dugnutt Jul 02 '18 at 10:34
2

19 free datasets:

  1. United States Census Data: The U.S. Census Bureau publishes reams of demographic data at the state, city, and even zip code level. The data set is fantastic for creating geographic data visualizations and can be accessed on the Census Bureau website. Alternatively, the data can be accessed via an API. One convenient way to use that API is through the choroplethr. In general, this data is very clean and very comprehensive.

  2. FBI Crime Data: The FBI crime data set is fascinating. If you’re interested in analyzing time series data, you can use it to chart changes in crime rates at the national level over a 20-year period. Alternatively, you can look at the data geographically.

AND MUCH MORE HERE: https://www.springboard.com/blog/free-public-data-sets-data-science-project/

Stephen Rauch
  • 1,783
  • 11
  • 21
  • 34