Youngjoon Jang

M.S. Student at NLP&AI Lab, Korea University

prof_pic.jpg

yjoonjang34@gmail.com

Hi, I’m Youngjoon. I’m a Master’s student in NLP&AI Lab at Korea University, advised by Prof. Heuiseok Lim. Before this, I studied Mechanical Engineering & Computer Science at Hongik University.

I’m drawn to a deceptively simple question: how can I help people find the right information? That curiosity drives my work in Information Retrieval (dense, sparse, and late-interaction retrieval), Multilingual Information Retrieval, and Retrieval-Augmented Generation (RAG). My research has been published at SIGIR, ICLR, ACL, and EMNLP, while the Korean retrieval models and benchmarks I led have grown to 200+ GitHub stars and 1.3M+ downloads on Hugging Face.

I love building in the open, and I actively contribute to projects including Sentence-Transformers, MTEB, and InstructKR.


News

Apr 02, 2026 Our paper “Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval” has been accepted to SIGIR 2026 🎉
Mar 02, 2026 Our paper “Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment” has been accepted to ICLR 2026 🎉

Education

Korea University Mar. 2025 – Present
M.S. in Computer Science and Engineering (Advisor: Prof. Heuiseok Lim, NLP&AI Lab)
Hongik University Mar. 2020 – Feb. 2025
B.S. in Computer Engineering & Mechanical and System Design Engineering (Double Major)

Projects

KURE: Korea University Retrieval Embedding Model (GitHub · HF)
Flagship Korean retrieval project — SOTA dense retriever (1st on MTEB-ko-retrieval), 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads. HCLT 2025 Best Oral Presentation.

Korean ColBERT & Sparse Retrievers (colbert-ko-v1 · splade-ko-v1 · inference-free-splade-ko-v1)
Trained and open-sourced Korean ColBERT and SPLADE variants achieving SOTA among corresponding architectures on the Korean Retrieval Benchmark.

WBL: World Best LLM Project (HF)
Led the data team: built a query-clarity tagging & evaluation framework, and a reward-model-ensemble response-filtering pipeline for large-scale alignment data.

KT–Korea University Collaborative Research (Korean Legal LLM) (News)
End-to-end training recipe for a Korean legal-domain LLM, published as LEGALMIDM (ICLR 2026 Data-FM Workshop); directly contributed to KT's $10.42M contract for the South Korean Supreme Court AI platform.

PreRanker (GitHub · HF)
Trained a lightweight reranker that narrows candidate tools, reducing tool-call scope for LLM agents.

URACLE–Korea University Collaborative Research
Trained a Korean–English cross-lingual retrieval model; used model merging to recover mono-lingual retrieval while retaining cross-lingual retrieval gains. Published in the ACL 2026 MeLLM Workshop.


Open Source Contributions

Sentence-Transformers

MTEB (Massive Text Embedding Benchmark)

InstructKR


Publications [Conference]

  1. Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, and 1 more author
    In Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2026), 2026
  2. Seongtae Hong, Youngjoon Jang, Jungseob Lee, and 2 more authors
    In Proceedings of the International Conference on Learning Representations (ICLR 2026), 2026
  3. Youngjoon Jang, Chanhee Park, Hyeonseok Moon, and 5 more authors
    In International Conference on Learning Representations Addressing Data Problems for Foundation Models Workshop (ICLR Data-FM Workshop), 2026
  4. Seungyoon Lee, Minhyuk Kim, Seongtae Hong, and 3 more authors
    In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), 2026
  5. Youngjoon Jang, Junyoung Son, Taemin Lee, and 5 more authors
    In Annual Meeting of the Association for Computational Linguistics Workshop on Multilinguality in the Era of Large Language Models (ACL MeLLM Workshop), 2026
  6. Youngjoon Jang, Seongtae Hong, Junyoung Son, and 3 more authors
    In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics: Student Research Workshop (ACL 2025 SRW), 2025
  7. Seonmin Koo, Jinsung Kim, Youngjoon Jang, and 2 more authors
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), 2024

Publications [Domestic Conference]

  1. Youngjoon Jang, Junyoung Son, Taemin Lee, and 3 more authors
    In Annual Conference on Human & Cognitive Language Technology (HCLT 2025), 2025
  2. Youngjoon Jang, Junyoung Son, Taemin Lee, and 3 more authors
    In Annual Conference on Human & Cognitive Language Technology (HCLT 2024), 2024

Preprint

  1. Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, and 1 more author
    Under Review, 2026
  2. MIMO: Multilingual Information Retrieval from Monolingual Oracles
    Youngjoon Jang, Seongtae Hong, and Heuiseok Lim
    Under Review, 2026
  3. SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval
    Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, and 1 more author
    Under Review, 2026
  4. NC-AI Consortium
    ArXiv Preprint, 2025