CV

Curriculum Vitae of Youngjoon Jang — research, publications, open-source contributions, and projects in information retrieval and RAG.

Contact Information

Name Youngjoon Jang
Professional Title Retrieval Researcher & Engineer
Email yjoonjang34@gmail.com

Professional Summary

Retrieval researcher and engineer specializing in dense, sparse, and late-interaction retrieval, multilingual information retrieval, and retrieval-augmented generation. Published at SIGIR, ICLR, ACL, and EMNLP; led Korean retrieval models and benchmarks with 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads.

Education

  • 2025 - Present

    Seoul, South Korea

    M.S.
    Korea University
    Computer Science and Engineering
    • NLP&AI Lab, advised by Prof. Heuiseok Lim
    • Research focus: Information Retrieval, Multilingual IR, RAG
  • 2020 - 2025

    Seoul, South Korea

    B.S.
    Hongik University
    Computer Engineering & Mechanical and System Design Engineering (Double Major)

Publications [Conference]

Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · SIGIR 2026
Improving Semantic Proximity in Information Retrieval through Cross-Lingual Alignment: Seongtae Hong, Youngjoon Jang, Jungseob Lee, Hyeonseok Moon, Heuiseok Lim · ICLR 2026
LEGALMIDM: Use-Case-Driven Legal Domain Specialization for Korean Large Language Model: Youngjoon Jang, Chanhee Park, Hyeonseok Moon, Young-kyoung Ham, Jiwon Moon, Jinhyeon Kim, JuKyung Jung, Heuiseok Lim · ICLR 2026 Data-FM Workshop
CLEAR: Cross-Lingual Enhancement in Retrieval via Reverse-training: Seungyoon Lee, Minhyuk Kim, Seongtae Hong, Youngjoon Jang, Dongsuk Oh, Heuiseok Lim · ACL 2026
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems: Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim · ACL 2025 SRW
Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts: Seonmin Koo, Jinsung Kim, Youngjoon Jang, Chanjun Park, Heuiseok Lim · EMNLP 2024

Publications [Domestic Conference]

KURE: Embedding Model for Korean-Specific Retrieval: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Jungbae Park, Heuiseok Lim · HCLT 2025 · Best Oral Presentation Award
KoE5: A New Dataset and Model for Improving Korean Embedding Performance: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Jungbae Park, Heuiseok Lim · HCLT 2024

Preprint

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocol: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · Under Review · 2026
MIMO: Multilingual Information Retrieval from Monolingual Oracles: Youngjoon Jang, Seongtae Hong, Heuiseok Lim · Under Review · 2026
SHIFT: Semantic Harmonization via Index-side Feature Transformation for Multilingual Information Retrieval: Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim · Under Review · 2026
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging: Youngjoon Jang, Junyoung Son, Taemin Lee, Seongtae Hong, Hyeonseok Moon, Seungyoon Lee, Andrew Matteson, Heuiseok Lim · arXiv · 2025
VAETKI Technical Report: NC-AI Consortium · arXiv · 2025

Open Source Projects

  • Sentence-Transformers
    • Extended the cross-encoder training stack with classic learning-to-rank losses (RankNetLoss, ListMLELoss, Position-Aware ListMLELoss) and runnable examples.
    • Implemented EmbedDistillLoss for direct embedding-level knowledge distillation (PR #3665).
    • Introduced hardness-weighted contrastive learning to up-weight informative hard negatives (PR #3667).
    • Implemented CachedSpladeLoss for gradient-cache compatible, memory-efficient SPLADE training (PR #3670).
  • MTEB (Massive Text Embeddings Benchmark)
    • Added a Korean retrieval benchmark task (AutoRAGRetrieval) and polished task metadata and registration (PR #1388).
    • Improved stability of the OpenAI embedding wrapper with sentence trimming and tighter dependency handling (PR #1526).
    • Fixed NaN embeddings for Jasper models by switching from float16 to bfloat16 (PR #2481).
  • InstructKR
    • Led the Korean Reranker evaluation and leaderboard project, establishing a standardized Korean benchmark suite for text reranking models.

Projects

  • WBL: World Best LLM Project
    • Led the data team; built a query clarity evaluation framework with engineered GPT-5.2 prompts and a Qwen3-4B tagger.
    • Established a response filtering pipeline ensembling three reward models with score fusion.
    • Curated large-scale alignment samples using LLM-as-a-judge and code-execution metrics.
  • KURE: Korea University Retrieval Embedding Model
    • Led the lab’s flagship Korean retrieval project; trained a SOTA dense retriever (1st on MTEB-ko-retrieval, Aug. 2025).
    • Designed and maintained MTEB-ko-retrieval, a standardized public Korean IR leaderboard.
    • Open-sourced the framework: 200+ GitHub stars and 1.3M+ cumulative Hugging Face downloads (Best Oral, HCLT 2025).
  • Korean ColBERT & Sparse Retrievers
    • Trained and open-sourced Korean ColBERT and SPLADE variants achieving SOTA among corresponding architectures (Feb. 2026).
    • Provided optimized, reproducible pipelines advancing dense-sparse hybrid retrieval experiments.
  • KT–Korea University Collaborative Research (Korean Legal LLM)
    • Developed an end-to-end training recipe for a Korean legal-domain LLM with expert-written and synthetic alignment data.
    • Directly contributed to KT winning a $10.42M contract to build an AI platform for the South Korean Supreme Court.
    • Published the methodology as LEGALMIDM at the ICLR 2026 Data-FM Workshop.
  • PreRanker
    • Built a lightweight reranker to narrow candidate tools, reducing tool-call scope for LLM agents.
  • URACLE–Korea University Collaborative Research
    • Trained a Korean–English cross-lingual retrieval model; used model merging to recover mono-lingual retrieval while retaining CLIR gains.

Skills

Languages: Python, Bash, SQL
Machine Learning: PyTorch, Hugging Face (Transformers, Sentence-Transformers, Accelerate), vLLM
Infrastructure & Tools: Docker, Linux, Git, Weights & Biases

Awards

  • 2025
    Best Oral Presentation Award
    Annual Conference on Human & Cognitive Language Technology

    Awarded for the paper: KURE: Embedding Model for Korean-Specific Retrieval

Languages

Korean : Native speaker
English : Fluent