CV | Youngjoon Jang

Contact Information

Name	Youngjoon Jang
Professional Title	Retrieval Researcher & Engineer
Email	yjoonjang34@gmail.com

Professional Summary

Retrieval researcher and engineer specializing in dense, sparse, and late-interaction retrieval, multilingual information retrieval, and retrieval-augmented generation. Published at SIGIR, ICLR, ACL, and EMNLP; led Korean retrieval models and benchmarks with 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads.

Education

2025 - Present

Seoul, South Korea
M.S.

Korea University

Computer Science and Engineering
- NLP&AI Lab, advised by Prof. Heuiseok Lim
- Research focus: Information Retrieval, Multilingual IR, RAG
2020 - 2025

Seoul, South Korea

B.S.

Hongik University

Computer Engineering & Mechanical and System Design Engineering (Double Major)

Publications [Conference]

Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · SIGIR 2026

Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment: Seongtae Hong, Youngjoon Jang, Jungseob Lee, Hyeonseok Moon, Heuiseok Lim · ICLR 2026

LEGALMIDM: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model: Youngjoon Jang, Chanhee Park, Hyeonseok Moon, Young-kyoung Ham, Jiwon Moon, Jinhyeon Kim, JuKyung Jung, Heuiseok Lim · ICLR 2026 Data-FM Workshop

CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training: Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh, Heuiseok Lim · ACL 2026

From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems: Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim · ACL 2025 SRW

Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts: Seonmin Koo, Jinsung Kim, Youngjoon Jang, Chanjun Park, Heuiseok Lim · EMNLP 2024

Publications [Domestic Conference]

KURE: Embedding Model for Korean-Specific Retrieval: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Jungbae Park, Heuiseok Lim · HCLT 2025 · Best Oral Presentation Award

KoE5: A New Dataset and Model for Improving Korean Embedding Performance: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Jungbae Park, Heuiseok Lim · HCLT 2024

Preprint

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocol: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · Under Review · 2026

MIMO: Multilingual Information Retrieval from Monolingual Oracles: Youngjoon Jang, Seongtae Hong, Heuiseok Lim · Under Review · 2026

SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · Under Review · 2026

Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Hyeonseok Moon, Seungyoon Lee, Andrew Matteson, Heuiseok Lim · arXiv · 2025

VAETKI Technical Report: NC-AI Consortium · arXiv · 2025

Open Source Projects

Sentence-Transformers
- Extended the cross-encoder training stack with classic learning-to-rank losses (RankNetLoss, ListMLELoss, Position-Aware ListMLELoss) and runnable examples.
- Implemented EmbedDistillLoss for direct embedding-level knowledge distillation (PR #3665).
- Introduced hardness-weighted contrastive learning to up-weight informative hard negatives (PR #3667).
- Implemented CachedSpladeLoss for gradient-cache compatible, memory-efficient SPLADE training (PR #3670).
MTEB (Massive Text Embeddings Benchmark)
- Added a Korean retrieval benchmark task (AutoRAGRetrieval) and polished task metadata and registration (PR #1388).
- Improved stability of the OpenAI embedding wrapper with sentence trimming and tighter dependency handling (PR #1526).
- Fixed NaN embeddings for Jasper models by switching from float16 to bfloat16 (PR #2481).
InstructKR
- Led the Korean Reranker evaluation and leaderboard project, establishing a standardized Korean benchmark suite for text reranking models.

Projects

WBL: World Best LLM Project
- Led the data team; built a query clarity evaluation framework with engineered GPT-5.2 prompts and a Qwen3-4B tagger.
- Established a response filtering pipeline ensembling three reward models with score fusion.
- Curated large-scale alignment samples using LLM-as-a-judge and code-execution metrics.
KURE: Korea University Retrieval Embedding Model
- Led the lab’s flagship Korean retrieval project; trained a SOTA dense retriever (1st on MTEB-ko-retrieval, Aug. 2025).
- Designed and maintained MTEB-ko-retrieval, a standardized public Korean IR leaderboard.
- Open-sourced the framework: 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads (Best Oral, HCLT 2025).
Korean ColBERT & Sparse Retrievers
- Trained and open-sourced Korean ColBERT and SPLADE variants achieving SOTA among corresponding architectures (Feb. 2026).
- Provided optimized, reproducible pipelines advancing dense-sparse hybrid retrieval experiments.
KT–Korea University Collaborative Research (Korean Legal LLM)
- Developed an end-to-end training recipe for a Korean legal-domain LLM with expert-written and synthetic alignment data.
- Directly contributed to KT winning a $10.42M contract to build an AI platform for the South Korean Supreme Court.
- Published the methodology as LEGALMIDM at the ICLR 2026 Data-FM Workshop.
PreRanker
- Built a lightweight reranker to narrow candidate tools, reducing tool-call scope for LLM agents.
URACLE–Korea University Collaborative Research
- Trained a Korean–English cross-lingual retrieval model; used model merging to recover mono-lingual retrieval while retaining CLIR gains.

Skills

Languages: Python, Bash, SQL

Machine Learning: PyTorch, Hugging Face (Transformers, Sentence-Transformers, Accelerate), vLLM

Infrastructure & Tools: Docker, Linux, Git, Weights & Biases

Awards

2025

Best Oral Presentation Award

Annual Conference on Human & Cognitive Language Technology

Awarded for the paper: KURE: Embedding Model for Korean-Specific Retrieval

Languages

Korean : Native speaker

English : Fluent

Contact Information

Professional Summary

Education

M.S.

Korea University

Computer Science and Engineering

B.S.

Hongik University

Computer Engineering & Mechanical and System Design Engineering (Double Major)

Publications [Conference]

Publications [Domestic Conference]

Preprint

Open Source Projects

Sentence-Transformers

MTEB (Massive Text Embeddings Benchmark)

InstructKR

Projects

WBL: World Best LLM Project

KURE: Korea University Retrieval Embedding Model

Korean ColBERT & Sparse Retrievers

KT–Korea University Collaborative Research (Korean Legal LLM)

PreRanker

URACLE–Korea University Collaborative Research

Skills

Awards

Best Oral Presentation Award

Annual Conference on Human & Cognitive Language Technology

Languages