Youngjoon Jang

yjoonjang34@gmail.com

Hi, I’m Youngjoon. I’m a Master’s student in NLP&AI Lab at Korea University, advised by Prof. Heuiseok Lim. Before this, I studied Mechanical Engineering & Computer Science at Hongik University.

I’m drawn to a deceptively simple question: how can I help people find the right information? That curiosity drives my work in Information Retrieval (dense, sparse, and late-interaction retrieval), Multilingual Information Retrieval, and Retrieval-Augmented Generation (RAG). My research has been published at SIGIR, ICLR, ACL, and EMNLP, while the Korean retrieval models and benchmarks I led have grown to 200+ GitHub stars and 1.3M+ downloads on Hugging Face.

I love building in the open, and I actively contribute to projects including Sentence-Transformers, MTEB, and InstructKR.

News

Apr 02, 2026	Our paper “Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval” has been accepted to SIGIR 2026 🇦🇺
Mar 02, 2026	Our paper “Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment” has been accepted to ICLR 2026 🇧🇷

Education

Korea University Mar. 2025 – Present

M.S. in Computer Science and Engineering (Advisor: Prof. Heuiseok Lim, NLP&AI Lab)

Hongik University Mar. 2020 – Feb. 2025

B.S. in Computer Engineering & Mechanical and System Design Engineering (Double Major)

Projects

KURE: Korea University Retrieval Embedding Model (GitHub · HF)
Led the flagship Korean retrieval project — trained SOTA dense retriever (1st on MTEB-ko-retrieval), 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads. Awarded Best Oral Presentation at HCLT 2025.

Korean ColBERT & Sparse Retrievers (colbert-ko-v1 · splade-ko-v1 · inference-free-splade-ko-v1)
Trained and open-sourced Korean ColBERT and SPLADE variants achieving SOTA among corresponding architectures on the Korean Retrieval Benchmark.

PreRanker (GitHub · HF)
Trained a lightweight reranker for tool retrieval in LLM agents, narrowing candidate tools before downstream execution. Achieved a 3.1% relative improvement in tool-retrieval Recall@10, enabling accurate candidate filtering for agent tool selection.

URACLE–Korea University Collaborative Research
Trained a Korean–English cross-lingual retrieval embedding model and analyzed language-pair trade-offs; used model merging to recover monolingual retrieval while retaining cross-lingual retrieval gains. Published in the ACL 2026 MeLLM Workshop.

WBL: World Best LLM Project (HF)
Led the data team: built a query-clarity tagging & evaluation framework, and a reward-model-ensemble response-filtering pipeline for large-scale training data.

KT–Korea University Collaborative Research (Korean Legal LLM) (News)
Developed an end-to-end training recipe for a Korean legal-domain LLM, published as LEGALMIDM (ICLR 2026 Data-FM Workshop); contributed to KT's $10.42M contract for the South Korean Supreme Court AI platform.

Open Source Contributions

Sentence-Transformers

Extended the cross-encoder training stack with classic learning-to-rank losses (RankNetLoss, ListMLELoss, Position-Aware ListMLELoss).
Implemented EmbedDistillLoss for direct embedding-level knowledge distillation.
Introduced hardness-weighted contrastive learning for hard negatives.
Implemented CachedSpladeLoss for memory-efficient SPLADE training.

MTEB (Massive Text Embedding Benchmark)

Added a Korean retrieval benchmark task (AutoRAGRetrieval).
Improved OpenAI embedding wrapper stability.
Fixed NaN embeddings for Jasper models.

InstructKR

Led the Korean Reranker evaluation and leaderboard project.

Publications [Conference]

SIGIR 2026
Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, and 1 more author

In Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2026

Investigating the importance of score distribution in knowledge distillation for dense retrieval, going beyond hard negatives.
@inproceedings{jang2026sigir, title = {Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval}, author = {Jang, Youngjoon and Hong, Seongtae and Moon, Hyeonseok and Lim, Heuiseok}, year = {2026}, booktitle = {Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)}, }

ICLR 2026

Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment

Seongtae Hong, Youngjoon Jang, Jungseob Lee, and 2 more authors

In Proceedings of the 14th International Conference on Learning Representations (ICLR), 2026

@inproceedings{hong2026iclr,
  title = {Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment},
  author = {Hong, Seongtae and Jang, Youngjoon and Lee, Jungseob and Moon, Hyeonseok and Lim, Heuiseok},
  year = {2026},
  booktitle = {Proceedings of the 14th International Conference on Learning Representations (ICLR)},
}

ICLR DATA-FM 2026

LEGALMIDM: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model

Youngjoon Jang, Chanhee Park, Hyeonseok Moon, and 5 more authors

In International Conference on Learning Representations Addressing Data Problems for Foundation Models Workshop (ICLR Data-FM Workshop), 2026

@inproceedings{jang2026legalmidm,
  title = {LEGALMIDM: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model},
  author = {Jang, Youngjoon and Park, Chanhee and Moon, Hyeonseok and Ham, Young-kyoung and Moon, Jiwon and Kim, Jinhyeon and Jung, JuKyung and Lim, Heuiseok},
  year = {2026},
  booktitle = {International Conference on Learning Representations Addressing Data Problems for Foundation Models Workshop (ICLR Data-FM Workshop)},
}

ACL 2026

CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training

Seungyoon Lee, Minhyuk Kim, Seongtae Hong, and 3 more authors

In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL), 2026

@inproceedings{lee2026clear,
  title = {CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training},
  author = {Lee, Seungyoon and Kim, Minhyuk and Hong, Seongtae and Jang, Youngjoon and Oh, Dongsuk and Lim, Heuiseok},
  year = {2026},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL)},
}

ACL MeLLM 2026

Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

Youngjoon Jang, Junyoung Son, Taemin Lee, and 5 more authors

In Annual Meeting of the Association for Computational Linguistics Workshop on Multilinguality in the Era of Large Language Models (ACL MeLLM Workshop), 2026

@inproceedings{jang2025crosslingual,
  title = {Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging},
  author = {Jang, Youngjoon and Son, Junyoung and Lee, Taemin and Hong, Seongtae and Moon, Hyeonseok and Lee, Seungyoon and Matteson, Andrew and Lim, Heuiseok},
  year = {2026},
  booktitle = {Annual Meeting of the Association for Computational Linguistics Workshop on Multilinguality in the Era of Large Language Models (ACL MeLLM Workshop)},
}

ACL SRW 2025

From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

Youngjoon Jang, Seongtae Hong, Junyoung Son, and 3 more authors

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (ACL SRW), 2025

@inproceedings{jang2025coreference,
  title = {From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems},
  author = {Jang, Youngjoon and Hong, Seongtae and Son, Junyoung and Park, Sungjin and Park, Chanjun and Lim, Heuiseok},
  year = {2025},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (ACL SRW)},
}

EMNLP 2024

Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts

Seonmin Koo, Jinsung Kim, Youngjoon Jang, and 2 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

@inproceedings{koo2024whereami,
  title = {Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts},
  author = {Koo, Seonmin and Kim, Jinsung and Jang, Youngjoon and Park, Chanjun and Lim, Heuiseok},
  year = {2024},
  booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
}

Publications [Domestic Conference]

HCLT 2025

KURE: Embedding Model for Korean-Specific Retrieval

Youngjoon Jang, Junyoung Son, Taemin Lee, and 3 more authors

In The 37th Annual Conference on Human & Cognitive Language Technology (HCLT), 2025

Best Paper

Best Oral Presentation Award

@inproceedings{jang2025kure,
  title = {KURE: Embedding Model for Korean-Specific Retrieval},
  author = {Jang, Youngjoon and Son, Junyoung and Lee, Taemin and Hong, Seongtae and Park, Jungbae and Lim, Heuiseok},
  year = {2025},
  booktitle = {The 37th Annual Conference on Human \& Cognitive Language Technology (HCLT)},
}

HCLT 2024

KoE5: A New Dataset and Model for Improving Korean Embedding Performance

Youngjoon Jang, Junyoung Son, Taemin Lee, and 3 more authors

In The 36th Annual Conference on Human & Cognitive Language Technology (HCLT), 2024

@inproceedings{jang2024koe5,
  title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
  author = {Jang, Youngjoon and Son, Junyoung and Lee, Taemin and Hong, Seongtae and Park, Jungbae and Lim, Heuiseok},
  year = {2024},
  booktitle = {The 36th Annual Conference on Human \& Cognitive Language Technology (HCLT)},
}

Preprint

Preprint

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocol

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, and 1 more author

Under Review, 2026

@article{jang2026mlaire,
  title = {MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocol},
  author = {Jang, Youngjoon and Hong, Seongtae and Moon, Hyeonseok and Lim, Heuiseok},
  year = {2026},
  journal = {Under Review},
}

Preprint

MIMO: Multilingual Information Retrieval from Monolingual Oracles

Youngjoon Jang, Seongtae Hong, and Heuiseok Lim

Under Review, 2026

@article{jang2026mimo,
  title = {MIMO: Multilingual Information Retrieval from Monolingual Oracles},
  author = {Jang, Youngjoon and Hong, Seongtae and Lim, Heuiseok},
  year = {2026},
  journal = {Under Review},
}

Preprint

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, and 1 more author

Under Review, 2026

@article{jang2026shift,
  title = {SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval},
  author = {Jang, Youngjoon and Hong, Seongtae and Moon, Hyeonseok and Lim, Heuiseok},
  year = {2026},
  journal = {Under Review},
}

ArXiv 2025

VAETKI Technical Report

NC-AI Consortium

ArXiv Preprint, 2025