Lang Cao

Hi! 👋 Here is Lang Cao, a recent graduate with a Master of Science (M.S.) in Computer Science from the University of Illinois Urbana-Champaign 🎓. I am currently working at Microsoft China. As a passionate and self-motivated thinker 🤔, practitioner, researcher, and entrepreneur in the field of Artificial Intelligence 🤖✨, I am deeply fascinated by novel and promising advancements that contribute to the development of advanced AI systems for the benefit of humanity ⚡.

During my master’s studies, I had the privilege of collaborating with Prof. Jimeng Sun at SunLab ☀️, UIUC. My research focused on leveraging natural language processing and machine learning techniques to improve healthcare processes 🏥, such as automating information extraction from healthcare literature 📚 and streamlining clinical trial workflows 💉📊.

As a researcher, I am dedicated to enhancing AI’s ability 🤖 to better understand and utilize human language, empowering these systems to interact with and positively impact the real world 🌎.

Now, I am open to any research collaborations and technical consulting opportunities in AI. Feel free to reach out to me via email! If you have any questions about my research, please also don’t hesitate to contact me!

Research Interests

Machine Learning (Natural Language Processing, Multi-modal Learning, etc.)
Large Language Models (Alignment,Efficiency, Reasoning, Agent, Application, etc.)
Machine Learning Application (AI for Healthcare, AI for Legal, etc.)

Currently, I am focusing on enhancing the reasoning capabilities of LLMs for tabular tasks and achieving effective alignment in multimodal LLMs.

News

11/11/2024: I am applying for 25Fall CS PhD now.
08/20/2024: I joined in Microsoft AI Aisa to conduct research on spreadsheet intelligence and table llms.
02/01/2023: I joined in Sunlab to work as a research assisstant.

Selected Publications

Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism [Paper] [Code]

Lang Cao (Independent Research).
EMNLP 2024 Main Conference, 2024 Conference on Empirical Methods in Natural Language Processing.
Abstract
Large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, enabling them to answer a wide range of questions across various domains. However, these models are not flawless and often produce responses that contain errors or misinformation. These inaccuracies, commonly referred to as hallucinations, render LLMs unreliable and even unusable in many scenarios. In this paper, our focus is on mitigating the issue of hallucination in LLMs, particularly in the context of question-answering. Instead of attempting to answer all questions, we explore a refusal mechanism that instructs LLMs to refuse to answer challenging questions in order to avoid errors. We then propose a simple yet effective solution called Learn to Refuse (L2R), which incorporates the refusal mechanism to enable LLMs to recognize and refuse to answer questions that they find difficult to address. To achieve this, we utilize a structured knowledge base to represent all the LLM's understanding of the world, enabling it to provide traceable gold knowledge. This knowledge base is separate from the LLM and initially empty. It can be filled with validated knowledge and progressively expanded. When an LLM encounters questions outside its domain, the system recognizes its knowledge scope and determines whether it can answer the question independently. Additionally, we introduce a method for automatically and efficiently expanding the knowledge base of LLMs. Through qualitative and quantitative analysis, we demonstrate that our approach enhances the controllability and reliability of LLMs.

PILOT: Legal Case Outcome Prediction with Case Law

Lang Cao, Zifeng Wang, Cao Xiao, Jimeng Sun.
NAACL 2024 Main Conference, 2024 Conference of the North American Chapter of the Association for Computational Linguistics.
Abstract
Machine learning shows promise in predicting the outcome of legal cases, but most research has concentrated on civil law cases rather than case law systems. We identified two unique challenges in making legal case outcome predictions with case law. First, it is crucial to identify relevant precedent cases that serve as fundamental evidence for judges during decision-making. Second, it is necessary to consider the evolution of legal principles over time, as early cases may adhere to different legal contexts.
In this paper, we proposed a new model named PILOT (PredictIng Legal case OuTcome) for case outcome prediction. It comprises two modules for relevant case retrieval and temporal pattern handling, respectively. To benchmark the performance of existing legal case outcome prediction models, we curated a dataset from a large-scale case law database. We demonstrate the importance of accurately identifying precedent cases and mitigating the temporal shift when making predictions for case law, as our method shows a significant improvement over the prior methods that focus on civil law case outcome predictions.

AutoRD: An Automatic and End-to-end System for Rare Disease Knowledge Graph Construction Based on Ontologies-enhanced Large Language Models [Paper] [Code]

Lang Cao, Adam Cross, Jimeng Sun.
JMIR Medical Informatics.
Abstract
Objectives: We aim to build an automatic and end-to-end system which can extract information about rare diseases from text and building knowledge graph. In the system, large language models give the system strong language analysis ability, while medical ontologies make up the medical knowledge shortage of large language models. We investigate the performance of our system in multiple aspects and present the strengths and limitations of this system. Materials and Methods: The experimental data is from the public dataset RareDis. We develop a system called AutoRD, which comprises medical ontologies and large language models. The system is a pipeline structure: data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We quantitatively evaluate our system in entity extraction and relation extraction. We also show some results of knowledge graph construction. Results: AutoRD achieves an overall F1 score of 47.3% with an improvement of 0.8% compared to the fine-tuning model and a 14.4% improvement compared to the base LLM. Our qualitative experiment also demonstrates that the performance in constructing the knowledge graph is commendable. Several designs, including the incorporation of ontologies-enhanced LLMs, contribute to the improvement of AutoRD. Discussion: AutoRD demonstrates superior performance compared to other methods, demonstrating the potential of LLM applications in the healthcare field. Conclusion: We built AutoRD, an automatic, end-to-end system for extracting rare disease information from text to build knowledge graphs. It uses ontologies-enhanced LLMs for a robust medical knowledge base. The superior performance of AutoRD is validated by experimental evaluations, demonstrating potential of large language models in healthcare.

KG-FIT: Knowledge Graph Fine-Tuning Upon Open-World Knowledge [Paper] [Code]

Pengcheng Jiang, Lang Cao, Cao Xiao, Parminder Bhatia, Jimeng Sun, Jiawei Han.
NeurIPS 2024, The Thirty-Eighth Annual Conference on Neural Information Processing Systems.
Abstract
Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-guided refinement to construct a semantically coherent hierarchical structure of entity clusters. By incorporating this hierarchical knowledge along with textual information during the fine-tuning process, KG-FIT effectively captures both global semantics from the LLM and local semantics from the KG. Extensive experiments on the benchmark datasets FB15K-237, YAGO3-10, and PrimeKG demonstrate the superiority of KG-FIT over state-of-the-art pre-trained language model-based methods, achieving improvements of 14.4%, 13.5%, and 11.9% in the Hits@10 metric for the link prediction task, respectively. Furthermore, KG-FIT yields substantial performance gains of 12.6%, 6.7%, and 17.7% compared to the structure-based base models upon which it is built. These results highlight the effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings.

Miscellaneous

Swimming, Soccer, Music, Cooking, Rubik’s cube…