CV
Curriculum Vitae of Youngjoon Jang — research, publications, open-source contributions, and projects in information retrieval and RAG.
Contact Information
| Name | Youngjoon Jang |
| Professional Title | Retrieval Researcher & Engineer |
| yjoonjang34@gmail.com |
Professional Summary
Retrieval researcher and engineer specializing in dense, sparse, and late-interaction retrieval, multilingual information retrieval, and retrieval-augmented generation. Published at SIGIR, ICLR, ACL, and EMNLP; led Korean retrieval models and benchmarks with 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads.
Education
-
2025 - Present Seoul, South Korea
M.S.
Korea University
Computer Science and Engineering
- NLP&AI Lab, advised by Prof. Heuiseok Lim
- Research focus: Information Retrieval, Multilingual IR, RAG
-
2020 - 2025 Seoul, South Korea
B.S.
Hongik University
Computer Engineering & Mechanical and System Design Engineering (Double Major)
Publications [Conference]
Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · SIGIR 2026
Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment: Seongtae Hong, Youngjoon Jang, Jungseob Lee, Hyeonseok Moon, Heuiseok Lim · ICLR 2026
LEGALMIDM: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model: Youngjoon Jang, Chanhee Park, Hyeonseok Moon, Young-kyoung Ham, Jiwon Moon, Jinhyeon Kim, JuKyung Jung, Heuiseok Lim · ICLR 2026 Data-FM Workshop
CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training: Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh, Heuiseok Lim · ACL 2026
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems: Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim · ACL 2025 SRW
Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts: Seonmin Koo, Jinsung Kim, Youngjoon Jang, Chanjun Park, Heuiseok Lim · EMNLP 2024
Publications [Domestic Conference]
KURE: Embedding Model for Korean-Specific Retrieval: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Jungbae Park, Heuiseok Lim · HCLT 2025 · Best Oral Presentation Award
KoE5: A New Dataset and Model for Improving Korean Embedding Performance: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Jungbae Park, Heuiseok Lim · HCLT 2024
Preprint
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocol: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · Under Review · 2026
MIMO: Multilingual Information Retrieval from Monolingual Oracles: Youngjoon Jang, Seongtae Hong, Heuiseok Lim · Under Review · 2026
SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · Under Review · 2026
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Hyeonseok Moon, Seungyoon Lee, Andrew Matteson, Heuiseok Lim · arXiv · 2025
VAETKI Technical Report: NC-AI Consortium · arXiv · 2025
Open Source Projects
-
Sentence-Transformers
- Extended the cross-encoder training stack with classic learning-to-rank losses (RankNetLoss, ListMLELoss, Position-Aware ListMLELoss) and runnable examples.
- Implemented EmbedDistillLoss for direct embedding-level knowledge distillation (PR #3665).
- Introduced hardness-weighted contrastive learning to up-weight informative hard negatives (PR #3667).
- Implemented CachedSpladeLoss for gradient-cache compatible, memory-efficient SPLADE training (PR #3670).
-
MTEB (Massive Text Embeddings Benchmark)
- Added a Korean retrieval benchmark task (AutoRAGRetrieval) and polished task metadata and registration (PR #1388).
- Improved stability of the OpenAI embedding wrapper with sentence trimming and tighter dependency handling (PR #1526).
- Fixed NaN embeddings for Jasper models by switching from float16 to bfloat16 (PR #2481).
-
InstructKR
- Led the Korean Reranker evaluation and leaderboard project, establishing a standardized Korean benchmark suite for text reranking models.
Projects
-
WBL: World Best LLM Project
- Led the data team; built a query clarity evaluation framework with engineered GPT-5.2 prompts and a Qwen3-4B tagger.
- Established a response filtering pipeline ensembling three reward models with score fusion.
- Curated large-scale alignment samples using LLM-as-a-judge and code-execution metrics.
-
KURE: Korea University Retrieval Embedding Model
- Led the lab’s flagship Korean retrieval project; trained a SOTA dense retriever (1st on MTEB-ko-retrieval, Aug. 2025).
- Designed and maintained MTEB-ko-retrieval, a standardized public Korean IR leaderboard.
- Open-sourced the framework: 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads (Best Oral, HCLT 2025).
-
Korean ColBERT & Sparse Retrievers
- Trained and open-sourced Korean ColBERT and SPLADE variants achieving SOTA among corresponding architectures (Feb. 2026).
- Provided optimized, reproducible pipelines advancing dense-sparse hybrid retrieval experiments.
-
KT–Korea University Collaborative Research (Korean Legal LLM)
- Developed an end-to-end training recipe for a Korean legal-domain LLM with expert-written and synthetic alignment data.
- Directly contributed to KT winning a $10.42M contract to build an AI platform for the South Korean Supreme Court.
- Published the methodology as LEGALMIDM at the ICLR 2026 Data-FM Workshop.
-
PreRanker
- Built a lightweight reranker to narrow candidate tools, reducing tool-call scope for LLM agents.
-
URACLE–Korea University Collaborative Research
- Trained a Korean–English cross-lingual retrieval model; used model merging to recover mono-lingual retrieval while retaining CLIR gains.
Skills
Languages: Python, Bash, SQL
Machine Learning: PyTorch, Hugging Face (Transformers, Sentence-Transformers, Accelerate), vLLM
Infrastructure & Tools: Docker, Linux, Git, Weights & Biases
Awards
-
2025 Best Oral Presentation Award
Annual Conference on Human & Cognitive Language Technology
Awarded for the paper: KURE: Embedding Model for Korean-Specific Retrieval
Languages
Korean : Native speaker
English : Fluent