Task Details
The task consists of identifying which instances, described by properties (i.e., attributes), represent the same real-world entity.
Participants are asked to solve the task among several datasets of different types (e.g., products, people, etc.) that will be released progressively. Each dataset is made of a list of instances (rows) and a list of properties describing them (columns); we will refer to each of these datasets as Di.
For each dataset Di, participants will be provided with the following resources:
- Xi : a subset of the instances in Di
- Yi : matching/non-matching labels for pairs in Xi x Xi
- Di metadata (e.g., how many instances it contains, what are the main characteristics)
Note that Y datasets are transitively closed (i.e., if A matches with B and B matches with C, then A matches with C).
Solutions will be evaluated over Zi = Di \ Xi. Note that the instances in Zi will not be provided to participants. More details are available in the Evaluation Process section.
Both Xi and Yi are in CSV format.
Example of dataset Xi
instance_id | attr_name_1 | attr_name_2 | ... | attr_name_k |
---|---|---|---|---|
00001 | value_1 | null | ... | value_k |
00002 | null | value_2 | ... | value_k |
... | ... | ... | ... | ... |
Example of dataset Yi
left_instance_id | right_instance_id | label |
---|---|---|
00001 | 00002 | 1 |
00001 | 00003 | 0 |
... | ... | ... |
More details about the datasets can be found in the dedicated Datasets section.
Your goal is to find, for each Xi dataset, all pairs of instances that match (i.e., refer to the same real-world entity). The output must be stored in a CSV file containing only the matching instance pairs found by your system. The CSV file must have two columns: "left_instance_id" and "right_instance_id" and the file must be named "output.csv". The separator must be the comma.
Example of output.csv
left_instance_id | right_instance_id |
---|---|
00001 | 00002 |
00001 | 00004 |
... | ... |
More details about the datasets can be found in the dedicated Submitting section.
# | Name | Description | Metadata | Download |
---|---|---|---|---|
1 | NotebookToy | Sample notebook specifications (will not be used for final leaderboard) | 128 instances 16 attributes 40 entities |
Dataset X1 Dataset Y1 |
2 | Notebook | Notebook specifications | 538 instances 14 attributes 100 entities |
Dataset X2 Dataset Y2 |
3 | NotebookLarge | Notebook specifications | 605 instances 14 attributes 158 entities |
Dataset X3 Dataset Y3 |
4 | Product specifications Kindly provided by Altosight |
1356 instances 5 attributes 193 entities |
Dataset X4 Dataset Y4 |
You can also download these datasets together with Snowman.
Snowman helps you to compare and evaluate your data matching solutions. You can upload experiment results from your data matching solution and then compare it easily with a gold standard, compare two experiment runs with each other or calculate binary metrics like precision or recall. Snowman is developed as part of a bachelor’s project at the Hasso Plattner Insitute, Potsdam, in collaboration with SAP SE.
You can download the latest release, which already includes the datasets provided for the contest.
Participants are asked to use ReproZip to pack the solution they want to submit.
ReproZip is a tool for packing input files, libraries and environmental variables in a single bundle (in .rpz format), that can be reproduced on any machine.
A brief guide on how to use ReproZip to package your solution follows.
First of all, you have to install ReproZip on your machine. ReproZip can be installed via pip (pip install reprozip). More details about the installation can be found in the dedicated Documentation page.
Let’s suppose that your code is made up of a Python module called "greedy_matcher.py" and that you launch your program with the following command: python greedy_matcher.py.
First of all, ReproZip needs to track the code execution. For this to happen, it will be sufficient to run the following command: reprozip trace python greedy_matcher.py.
The code will be executed and a hidden folder (called ".reprozip-trace") will be created at the end of the process. This folder contains a "config.yml" file, that is a configuration file containing information about input/output files, libraries, environmental variables, etc. traced during the execution of your code. If you want to omit something you think it’s not useful to be packed, you can manually edit this file. Please, be sure not to remove any libraries or files needed to reproduce the code, to avoid the risk of it being not-reproducible.
Finally, to create the bundle, you have to run the following command: reprozip pack submission.rpz.
Please note that, if your code is made up of more than one file, even if they are written in different programming languages, you can trace the execution by using the option "--continue". You can find more details in the dedicated Documentation page.
At this point, a file called "submission.rpz" will be created and you can submit it using our dashboard.
Inside your code, it is important to refer to each input dataset Xi using its original name "Xi.csv".
Submitted solutions will be unpacked and reproduced using ReproUnzip on an evaluation server with the following characteristics:
Processor | 32 CPU x 2.1 GHz |
Main Memory | 64 GB |
Storage | 2 TB |
Operating System | Linux |
In particular, before to run ReproUnzip, the Xi dataset (i.e., the original input you worked on) will be replaced with the Zi dataset, which contains the hidden instances.
Here is the detailed sequence of operations used for the evaluation process:
- reprounzip docker setup <bundle> <solution>, to unpack the uploaded bundle
- reprounzip docker upload <solution> Zj.csv:Xi.csv to replace the input datasets (X2.csv, X3.csv, ...) with the hidden ones (Z2.csv, Z3.csv, ...), potentially in shuffled order (e.g., X2 can be replaced by Z3)
- reprounzip docker run <solution>
- reprounzip docker download <solution> output.csv (i.e., you must produce just one file "output.csv" cumulative for all the datasets)
- evaluation of "output.csv"
Note that, in order to be evaluated, your submission must be reproduced correctly (i.e., the process must end with the creation of the "output.csv" file without errors) and must run on all the datasets (total) in no more than a given timeout (please note that the whole cycle above must be executed within the defined timeout). Timeout value is available below and will be updated every time a new dataset is released.
TIMEOUT: 25 min (last updated: 6 April 2021)
For each dataset Di we will compute resulting F-measure with respect to Zi x Zi. Submitted solutions will be ranked on average F-measure over all datasets.
Unfortunately, ReproUnzip sometimes prints useful information about the occurred errors on stdout, causing its exclusion from the submission log. In case you are stuck on a technical error preventing the successful reproduction of your submission, with no useful information appearing in the log, and not even the provided ReproUnzip commands can help you to find out the cause of the error, you can send us an email, so that we can check the content of stdout in order to detect the presence of any useful information about the error.
Rules
- ACM SIGMOD 2021 Programming Contest is open to undergraduate and graduate students from degree-granting institutions all over the world. However, students associated with the organizers' institutions are not eligible to participate.
- Teams must consist of individuals currently registered as graduate or undergraduate students at an accredited academic institution. A team may be formed by one or more students, who need not to be enrolled at the same institution. Several teams from the same institution can compete independently, but one person can be a member of only one team. There is no limit on team size. Teams can register on the contest site after 25 February 2021.
- All submissions must consist only of code written by the team or open source licensed software (i.e., using an OSI-approved license). For source code from books or public articles, clear reference and attribution must be made. Final submissions must be made by 30 April 2021 (anywhere on Earth).
- All teams must agree to license their code under an OSI-approved open source license. By participating in this contest, each team agrees to publish its source code. The finalists' implementations will be made public on the contest website.