NTCIR-11 IMine Task Guidelines

Yiqun Liu1, Ruihua Song2, Min Zhang1, Zhicheng Dou2, Takehiro Yamamoto3, Makoto Kato3, Hiroaki Ohshima3, Ke Zhou4

1Tsinghua University, 2Microsoft Research Asia, 3Kyoto University, 4University of Edinburgh

 

Welcome to the NTCIR-11 IMine Task. Our goal is to explore and evaluate the technologies of satisfying different user intents behind a Web search query by mining subtopics and generating diversified search rankings. IMine is a core task of NTCIR-11 and also a succeeding work of INTENT@NTCIR-9 and INTENT2@NTCIR-10. If you are new to the IMine/INTENT tasks, you may want to start by reading overview papers of INTENT and INTENT2.

 

IMine also consists a TaskMine subtask which aims to find subtasks of a given task described by a query. If you are interested in this subtask, please visit the TaskMine subtask homepage.

Tentative Timetable

l Corpus available: Aug 31, 2013

l Call for participants: Aug 31, 2013

l NTCIR-11 task participant registration Due: Jan 20, 2013 (extended by NII for all NTCIR-11 tasks)

l Topics and non-diversified baseline DR runs released: Jan 21, 2014

l All submissions due: May 23, 2014 ( Submission Information)

l Evaluation results available: Aug 15, 2014 ( Evaluation Results)

l Early draft overview paper available: Aug 22, 2014 ( Draft overview paper)

l Draft participant paper submission due: Sept 15, 2014 (Paper Submission Site)

l Final Overview paper available: Oct 1, 2014

l Camera-ready participant paper submission due: Nov 1, 2014

l NTCIR-11 conference: Dec 9-14, 2014

 

To participate in NTCIR-11 IMine: Please visit the NTCIR-11 participant registration page.

Overview

Many Web search queries are short and vague according to a number of existing works on search user behavior analysis. It means that users may have different intents with a same search query. With an ambiguous query, users may seek for different interpretations behind this query. While with a query on a broad topic (a broad query), users may be interested in different subtopics under this broad topic. To help search engine understand the diversified search intents and provide satisfying search rankings, we propose this IMine task to provide common data sets and evaluation methodology on diversified search technologies. The task name “IMine” is short for search Intent Mining and it also pronounces like “曖昧” which means “ambiguous” in Chinese and Japanese.

 

Through IMine@NTCIR-11, we expect participants to advance the state-of-the-art techniques explored in INTENT/INTENT2 with more emphasize on analysis into the complex (usually multi-level) intents behind queries. The task is composed of three subtasks: Subtopic Mining, Document Ranking and TaskMine. In the Subtopic Mining task, whether a query is ambiguous, broad or clear should be judged by participants and a two-level hierarchy of search intents are expected to be constructed for non-clear queries. The generated subtopic hierarchy will then be adopted in the Document Ranking task to generate a diversified ranking of search results for each query which is supposed to meet as many users’ information needs as possible. You may participant in all three subtasks, or just select one or more subtasks.

 

Compared with previous INTENT/INTENT2 tasks, more user behavior data are provided to help both participants and assessors for subtopic generation and importance estimation. We are also interested in comparing the differences in diversified search annotations between a small number of professional assessors and a relatively large number of untrained users. It means each participant’s run will be assessed and ranked in both professional assessors’ and ordinary search users’ opinions. Similar with previous tasks, Chinese, English and Japanese datasets are available for evaluating technologies and a number of query topics will be shared among languages.

 

The major differences between IMine and past INTENT2 task are summarized in the following table.

 

INTENT2

IMINE

Number of Topics

·           Chinese: 100

·           Japanese: 100

·           English: 50

·           Chinese: 50

·           Japanese: 50

·           English: 50

DR task setting

·           Chinese: SogouT (Ver.2008)

·           Japanese: ClueWeb JA

·           Chinese: SogouT (Ver.2008)

·           English: ClueWeb12-B13

Crowd sourcing

No

Crowd sourcing evaluation for Chinese DR

Subtopic organization

One level

Two level: no more than 5 first-level subtopics with at most 10 second-level subtopics each

Subtopic candidate

Query suggestions from Bing, Google, Sogou and Baidu

·           Query suggestions from Bing, Google, Sogou, Yahoo! and Baidu

·           Query facets generated by MSR from search engine results

·           Query facets generated by THU from Sogou log data

User behavior data

SogouQ (data collected in 2008): 2GB approximately

SogouQ (data collected in 2008 and 2011): 4GB approximately

DR Baseline

·           Chinese DR baseline

·           Japanese DR baseline

·           ClueWeb12-B13 retrieval service is provided by CMU

·           SogouT retrieval service is provided by Tsinghua

 

Query Topics

The same query topics will be adopted in both Subtopic Mining and Document Ranking subtasks for all languages. These topics are sampled from the median-frequency queries collected from both Sogou and Bing search logs. Approximately equal amounts of ambiguous, broad and clear queries will be included in the query topic set. About 10-20 topics will be shared among different languages for possible future cross-language research purposes.

 

Query Topics which are included in a txt file zipped with 7zip can be downloaded here.

Document Collection & User Behavior Data Collection

SogouT will be adopted as the document collection for Chinese topics in Document Ranking subtask. The collection contains about 130M Chinese pages together with the corresponding link graph. The size is roughly 5TB uncompressed. The data was crawled and released on Nov 2008. Further information regarding this collection can be found on the page http://www.sogou.com/labs/dl/t-e.html. You can directly contact chenjing@sogou-inc.com to obtain the data set.  For both Subtopic Mining and Document Ranking subtasks, SogouQ search user behavior data collection is available for participants as additional resources. The collection contains queries and click-through data collected and sampled in November, 2008 (consistent with SogouT). A new version of SogouQ is also available now which is a sample of data collected in 2012. Further information regarding the data can be found on the page http://www.sogou.com/labs/dl/q.html.

 

As for English Document Ranking subtask, the ClueWeb12-B13 data set will be adopted, which is includes 52M English Web pages crawled during 2012and a search interface provided by Lemur project. We appreciate Prof. Jamie Callan and his team for providing the collection, which dramatically reduces the working efforts of participants. Further information regarding the collections can be found on http://lemurproject.org/clueweb12/.

 

A large number of user behavior data collected from Bing and Sogou will be adopted to assist assessors in the annotation process for subtopic pre-clustering and importance pre-estimation. Although these data will not be open to participants, we believe that this alteration will improve the reusability, reliability and efficiency of the evaluation results. Therefore all participants and other researchers may benefit from these not-to-open user behavior data sets.

Subtopic Mining Subtask

In the Subtopic Mining task, a subtopic could be an interpretation of an ambiguous query or an aspect of a broad query. Participants are expected to generate a two-level hierarchy of underlying subtopics by analysis into the provided document collection, user behavior data set or other kinds of external data sources. A list of query suggestions/completions collected from popular commercial search engines will be provided as possible subtopic candidates but participants shouldn’t suppose that there are no other subtopics besides these candidates.  These candidates come from the following resources:

Query suggestions collected from Bing, Google, Sogou, Yahoo! and Baidu (Download)

Query dimensions generated by [2] from search engine result page (Download)

Related Web search queries generated by [3] from Sogou log data (Download)

 

As for the two-level hierarchy of subtopics, let’s take the ambiguous query “windows” as an example. The first-level subtopic may be Microsoft Windows software or house windows. In the category of Microsoft Windows, users may be interested in different aspects (second-level subtopics), such as “Windows 8”, “Windows update”, etc. At most FIVE first-level subtopics with no more than TEN second-level subtopics each should be returned for each query topic. There is no need to return subtopics for clear queries but participants will not be penalized for doing this in the evaluation. The first-level subtopics for broad queries will not be taken into consideration in the evaluation because there may be various standards for organizing high-level aspects for these queries.

 

Besides the hierarchy of subtopics, a ranking list of all first-level subtopics and a separate ranking list of all second-level subtopics should also be returned for each ambiguous/broad query. With these ranking lists, the importance estimation results could be evaluated and compared among different participant runs. Participants should also submit their judgments for the query topics in whether they are ambiguous, broad or clear.

 

In Subtopic Mining task, we will have the submitted first-level/second-level subtopics grouped as intents separately. After that, a number of assessors will verify the relevancy and quality of these subtopics and vote for the importance of them. We will estimate the importance of the subtopics based on assessors’ votes. Finally, we will evaluate the submitted hierarchy of subtopics based on both the quality of the hierarchy (whether the second-level subtopic actually belongs to the first-level one, with metrics such as Precision/Recall/F-measure) and the quality of the subtopics (whether the first-level/second-level subtopics cover important aspects for the query, with D#-measures proposed in [1]).

 

A preliminary version of the evaluation metric for Subtopic Mining is described as follows to help design algorithms for participants. In this metric, the quality of the submitted hierarchy will be evaluated by:

 

The definitions of H-score, F-score and S-score are as follows and they each describe one aspect of the submitted hierarchy.

1.         H-score measures the quality of the hierarchical structure: whether the second-level subtopic is correctly assigned to the appropriate first-level subtopic.

 

Here N(1) is the number of first-level subtopics for a certain query topic in the submission (no more than 5). Accuracy(i) is the percentage of correctly-assigned second-level subtopics for first-level subtopic i. If first-level subtopic i is not relevant to the query topic, then Accuracy(i) should be 0. Irrelevant second-level subtopics should not be regarded as “correctly-assigned” ones.

 

2.         F-score measures the quality of the first-level subtopic: whether the submitted first-level subtopics are correctly ranked and whether all important first-level subtopics are found

         Here  is the first-level subtopic list for a certain query topic ranked by the score contained in submission file.

 

3.         S-score measures the quality of the second-level subtopic

         Here  is the second-level subtopic list for a certain query topic ranked by the scores of the second-level subtopic (the scores should be global across different first-level subtopics, see result format).

Document Ranking Subtask

In document ranking task, Participants are asked to return a diversified ranked list of no more than 100 results for each query. We encourage participants to selectively use diversification algorithms in ranking because diversification is not necessary for all queries. Based on the subtopic mining results, participants are supposed to select important first-level/second-level subtopics and mix them to form a diversified ranking list. The goals of diversification are (a) to retrieve documents that cover as many intents as possible; and (b) to rank documents that are highly relevant to more popular intents higher than those that are marginally relevant to less popular intents.

 

Document Ranking task are performed on SogouT for Chinese topics and ClueWeb12-B13 for English topics. In order to help participants who are not able to construct their own retrieval platforms, we provide a non-diversified baseline Chinese DR run based on our own retrieval system. Retrieval results can also be obtained through the ClueWeb12 search interface

 

As for evaluation of the Document Ranking task, we will follow a standard Cranfiled-like approach in which a number of top-ranked results in participants’ submitted runs will be included in the result pool. Each document in the pool will be assessed by professional assessors and its relevance with each first-level or second-level subtopic (obtained in the Subtopic Mining task) will be judged. We will evaluate submissions mainly by D#-measures proposed in [1]. Meanwhile, crowd sourcing evaluation will also be performed at least for part of the query topics by a relatively large number of unprofessional users.

TaskMine Subtask

Please visit the task guideline to learn more about this pilot task.

Submission Information

All subtopic mining and document ranking submissions must be compressed (zip). Each participating team can submit up to 5 runs to for each subtask and language pair. In a run of the subtopic mining task, please submit at most FIVE first-level subtopics each with no more than TEN second-level subtopics per topic. In a run of the document ranking task, please submit up to 100 documents per topic. We may use a cutoff, e.g. 10, in evaluation. As the run files may contain Chinese or Japanese characters, they must be encoded in UTF-8.

 

All runs should be generated completely automatically. No manual run is allowed. Please check the format of your run files with the error checking script before submission.

 

Run Names

 

Run files should be named as “<teamID>-[SD]-[CEJ]-[priority][AB].txt”

 

<teamID> is exactly the Team ID when you registered in NTCIR-11. You can contact the organizer if you forgot your team ID.

[SD]: S for Subtopic Mining runs, D for Document Ranking runs

[CEJ]: C for Chinese runs, E for English runs, J for Japanese runs

[priority]: due to limited resources, we may not include all submitted runs in the result pool. Therefore, it is important for you to point out in which order we should take your runs into consideration for result pool construction.

[AB]: all Subtopic Mining runs are A runs; Document Ranking runs which employs the non-diversified baseline provided by organizers are B runs; all other Document Ranking runs are A runs.

 

Subtopic Mining Run Submissions format

 

For all runs in Subtopic Mining subtask, Line 1 of the submission file must be of the form:

<SYSDESC>[insert a short description in English here]</SYSDESC>

 

The rest of the file should contain lines of the form:

[TopicID];0;[1st level Subtopic];[Rank1];[Score1];0;[2nd level Subtopic];[Rank2];[Score2];[RunName]\n

 

At most FIVE first-level subtopics with no more than TEN second-level subtopics each should be returned for each query topic.

 

NOTE that [Rank2] and [Score2] should be GLOBAL ranks and scores across different 1st level subtopics

 

For example, an English Subtopic Mining run should look like this:

 

0051;0;Microsoft Windows;1;0.99;0;Windows 8;1;0.98;MSRA-S-E-1A

0051;0;Microsoft Windows;1;0.99;0;Windows phone 8;4;0.76;MSRA-S-E-1A

0051;0;Microsoft Windows;1;0.99;0;Windows update;5;0.69;MSRA-S-E-1A

0051;0;House Windows;2;0.82;0;Windows repairing;2;0.97;MSRA-S-E-1A

0051;0;House Windows;2;0.82;0;Window glass sale;3;0.87;MSRA-S-E-1A

 

Document Ranking Run Submissions format

 

For all runs in Document Ranking subtask, Line 1 of the submission file must be of the form:

<SYSDESC>[insert a short description in English here]</SYSDESC>

 

The rest of the file should contain lines of the form:

[TopicID] 0 [DocumentID] [Rank] [Score] [RunName]\n

 

At most 100 documents should be returned for each query topic.

 

For example, an English Document ranking run should look like this:

0101 0 clueweb12-0006-97-23810 1 27.73 MSRA-D-E-1A

0101 0 clueweb12-0009-08-98321 2 25.15 MSRA-D-E-1A

0101 0 clueweb12-0003-71-19833 3 21.89 MSRA-D-E-1A

0101 0 clueweb12-0002-66-03897 4 13.57 MSRA-D-E-1A

 

Checking Your Runs

Please check the format of your run files with the error checking script before submission. Badly formatted run files will not be evaluated.

 

Submission Site

Please submit your results via the FTP site. Please send an EMail to ntcir.imine@gmail.com and with the title "Result submission for NTCIR IMine - teamID" after the submission. The Email should also contain a brief introduction to the methodology adopted for each run (less than 100 words)

 

Released Data

Please download topic files for INTENT and INTENT2 for training purposes here (strongly recommended!). The annotation results for submitted subtopics and documents are also available here with an online application process through NII.  

 

Query topics are ready for download as well as the related subtopic resources including query dimensions, query suggestions from a number of search engines and related queries for Chinese topics. These resources are provided to participants as possible subtopic candidates. Meanwhile, participants are welcome to generate subtopic candidates with their own strategies.

 

We provide a non-diversified baseline Chinese DR run to help participants who donot want to construct their own system. The run is based on SogouT (Ver. 2008) corpus using TMiner retrieval system. The raw Web page file are also provided together with at most 200 results for each query topic. Note that no results are returned for the query topic No. 0033 (安卓2.3游戏下载) because Android 2.3 was not released in 2008

 

Result Evaluation

Preliminary Result Assessments:

English Subtopic Mining assessments: Download

Chinese Subtopic Mining assessments: Download

Japanese Subtopic Mining assessments: Download

English Document Ranking assessments: Download

Chinese Document Ranking assessments: Download

 

Preliminary Result Evaluation:

Subtopic Mining result evaluation: (Per Topic Results) (Detailed Results)

Document Ranking result evaluation: (Per Topic Results) (Detailed Results)

 

Draft Overview Paper:

Overview of the NTCIR-11 IMine Task: Download

 

Reference

[1] Sakai, T. and Song, R.: Evaluating Diversified Search Results Using Per-Intent Graded Relevance, ACM SIGIR 2011, pp.1043-1052, July 2011.

[2] Zhicheng Dou, Sha Hu, Yulong Luo, Ruihua Song and Ji-Rong Wen.: Finding Dimensions for Queries, ACM CIKM 2011, October 2011.

[3] Yiqun Liu, Junwei Miao, Min Zhang, Shaoping Ma, Liyun Ru.: How Do Users Describe Their Information Need: Query Recommendation based on Snippet Click Model. Expert Systems With Applications. 38(11): 13847-13856, 2011.

[4] IMine task proposal slides presented at NTCIR 10 workshop. SLIDES

Contact Us

If any question, please send Email to ntcir.imine@gmail.com and

 

 

 

Last updated on July 21th, 2014.