avatar

Lang Cao

A passionate and self-motivated thinker, practitioner, researcher, and entrepreneur in AI.

Publications

2023

AutoRD: An Automatic and End-to-end System for Rare Disease Knowledge Graph Construction Based on Ontologies-enhanced Large Language Models [Paper] [Code]

  • Lang Cao, Adam Cross, Jimeng Sun.
  • Under Review.
  • AbstractObjectives: We aim to build an automatic and end-to-end system which can extract information about rare diseases from text and building knowledge graph. In the system, large language models give the system strong language analysis ability, while medical ontologies make up the medical knowledge shortage of large language models. We investigate the performance of our system in multiple aspects and present the strengths and limitations of this system. Materials and Methods: The experimental data is from the public dataset RareDis. We develop a system called AutoRD, which comprises medical ontologies and large language models. The system is a pipeline structure: data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We quantitatively evaluate our system in entity extraction and relation extraction. We also show some results of knowledge graph construction. Results: AutoRD achieves an overall F1 score of 47.3% with an improvement of 0.8% compared to the fine-tuning model and a 14.4% improvement compared to the base LLM. Our qualitative experiment also demonstrates that the performance in constructing the knowledge graph is commendable. Several designs, including the incorporation of ontologies-enhanced LLMs, contribute to the improvement of AutoRD. Discussion: AutoRD demonstrates superior performance compared to other methods, demonstrating the potential of LLM applications in the healthcare field. Conclusion: We built AutoRD, an automatic, end-to-end system for extracting rare disease information from text to build knowledge graphs. It uses ontologies-enhanced LLMs for a robust medical knowledge base. The superior performance of AutoRD is validated by experimental evaluations, demonstrating potential of large language models in healthcare.

Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism [Paper] [Code]

  • Lang Cao (Independent Research).
  • Under Review.
  • AbstractLarge language models (LLMs) have demonstrated impressive language understanding and generation capabilities, enabling them to answer a wide range of questions across various domains. However, these models are not flawless and often produce responses that contain errors or misinformation. These inaccuracies, commonly referred to as hallucinations, render LLMs unreliable and even unusable in many scenarios. In this paper, our focus is on mitigating the issue of hallucination in LLMs, particularly in the context of question-answering. Instead of attempting to answer all questions, we explore a refusal mechanism that instructs LLMs to refuse to answer challenging questions in order to avoid errors. We then propose a simple yet effective solution called Learn to Refuse (L2R), which incorporates the refusal mechanism to enable LLMs to recognize and refuse to answer questions that they find difficult to address. To achieve this, we utilize a structured knowledge base to represent all the LLM's understanding of the world, enabling it to provide traceable gold knowledge. This knowledge base is separate from the LLM and initially empty, and it is progressively expanded with validated knowledge. When an LLM encounters questions outside its domain, the system recognizes its knowledge scope and determines whether it can answer the question independently. Additionally, we introduce a method for automatically and efficiently expanding the knowledge base of LLMs. Through qualitative and quantitative analysis, we demonstrate that our approach enhances the controllability and reliability of LLMs.

DiagGPT: An LLM-based Chatbot with Automatic Topic Management for Task-Oriented Dialogue [Paper] [Code]

  • Lang Cao (Independent Research).
  • Under Review.
  • AbstractLarge Language Models (LLMs), such as ChatGPT, are becoming increasingly sophisticated, demonstrating capabilities that closely resemble those of humans. These AI models are playing an essential role in assisting humans with a wide array of tasks in daily life. A significant application of AI is its use as a chat agent, responding to human inquiries across various domains. Current LLMs have shown proficiency in answering general questions. However, basic question-answering dialogue often falls short in complex diagnostic scenarios, such as legal or medical consultations. These scenarios typically necessitate Task-Oriented Dialogue (TOD), wherein an AI chat agent needs to proactively pose questions and guide users towards specific task completion. Previous fine-tuning models have underperformed in TOD, and current LLMs do not inherently possess this capability. In this paper, we introduce DiagGPT (Dialogue in Diagnosis GPT), an innovative method that extends LLMs to TOD scenarios. Our experiments reveal that DiagGPT exhibits outstanding performance in conducting TOD with users, demonstrating its potential for practical applications.

Enhancing Reasoning Capabilities of Large Language Models: A Graph-Based Verification Approach [Paper] [Code]

  • Lang Cao (Independent Research).
  • Under Review.
  • AbstractLarge Language Models (LLMs) have showcased impressive reasoning capabilities, particularly when guided by specifically designed prompts in complex reasoning tasks such as math word problems. These models typically solve tasks using a chain-of-thought approach, which not only bolsters their reasoning abilities but also provides valuable insights into their problem-solving process. However, there is still significant room for enhancing the reasoning abilities of LLMs. Some studies suggest that the integration of an LLM output verifier can boost reasoning accuracy without necessitating additional model training. In this paper, we follow these studies and introduce a novel graph-based method to further augment the reasoning capabilities of LLMs. We posit that multiple solutions to a reasoning task, generated by an LLM, can be represented as a reasoning graph due to the logical connections between intermediate steps from different reasoning paths. Therefore, we propose the Reasoning Graph Verifier (RGV) to analyze and verify the solutions generated by LLMs. By evaluating these graphs, models can yield more accurate and reliable results.Our experimental results show that our graph-based verification method not only significantly enhances the reasoning abilities of LLMs but also outperforms existing verifier methods in terms of improving these models' reasoning performance.

AutoAM: An End-To-End Neural Model for Automatic and Universal Argument Mining [Paper] [Code]

  • Lang Cao (Independent Research).
  • In 19th anniversary of the International Conference on Advanced Data Mining and Applications, ADMA 2023.
  • AbstractArgument mining is to analyze argument structure and extract important argument information from unstructured text. An argument mining system can help people automatically gain causal and logical information behind the text. As argumentative corpus gradually increases, like more people begin to argue and debate on social media, argument mining from them is becoming increasingly critical. However, argument mining is still a big challenge in natural language tasks due to its difficulty, and relative techniques are not mature. For example, research on non-tree argument mining needs to be done more. Most works just focus on extracting tree structure argument information. Moreover, current methods cannot accurately describe and capture argument relations and do not predict their types. In this paper, we propose a novel neural model called AutoAM to solve these problems. We first introduce the argument component attention mechanism in our model. It can capture the relevant information between argument components, so our model can better perform argument mining. Our model is a universal end-to-end framework, which can analyze argument structure without constraints like tree structure and complete three subtasks of argument mining in one model. The experiment results show that our model outperforms the existing works on several metrics in two public datasets.

PILOT: Legal Case Outcome Prediction with Case Law

  • Lang Cao, Zifeng Wang, Cao Xiao, Jimeng Sun.
  • Under Review.
  • AbstractMachine learning shows promise in predicting the outcome of legal cases, but most research has concentrated on civil law cases rather than case law systems. We identified two unique challenges in making legal case outcome predictions with case law. First, it is crucial to identify relevant precedent cases that serve as fundamental evidence for judges during decision-making. Second, it is necessary to consider the evolution of legal principles over time, as early cases may adhere to different legal contexts.
    In this paper, we proposed a new model named PILOT (PredictIng Legal case OuTcome) for case outcome prediction. It comprises two modules for relevant case retrieval and temporal pattern handling, respectively. To benchmark the performance of existing legal case outcome prediction models, we curated a dataset from a large-scale case law database. We demonstrate the importance of accurately identifying precedent cases and mitigating the temporal shift when making predictions for case law, as our method shows a significant improvement over the prior methods that focus on civil law case outcome predictions.

2022

2021

CBCP: A Method of Causality Extraction from Unstructured Financial Text [Paper] [Code]

  • Lang Cao, Shihua Zhang and Juxing Chen.
  • In 2021 5th International Conference on Natural Language Processing and Information Retrieval, NLPIR 2021.
  • AbstractExtracting causality information from unstructured natural language text is a challenging problem in natural language processing. However, there are no mature special causality extraction systems. Most people use basic sequence labeling methods, such as BERT-CRF model, to extract causal elements from unstructured text and the results are usually not well. At the same time, there is a large number of causal event relations in the field of finance. If we can extract enormous financial causality, this information will help us better understand the relationships between financial events and build related event evolutionary graphs in the future. In this paper, we propose a causality extraction method for this question, named CBCP (Center word-based BERT-CRF with Pattern extraction), which can directly extract cause elements and effect elements from unstructured text. Compared to BERT-CRF model, our model incorporates the information of center words as prior conditions and performs better in the performance of entity extraction. Moreover, our method combined with pattern can further improve the effect of extracting causality. Then we evaluate our method and compare it to the basic sequence labeling method. We prove that our method performs better than other basic extraction methods on causality extraction tasks in the finance field. At last, we summarize our work and prospect some future work.

Intelligent Cross-sensing Sensor Based on Deep Learning [Paper]

  • Lingfei Xu, Jiaming Zhang, Lang Cao, and Xinyu Hu.
  • In 2021 6th IEEE International Conference on Signal and Image Processing, ICSIP2021.
  • AbstractQualitative and quantitative detection of gases is of great importance in industrial automation, environmental protection, chemical control and other fields. Low-cost and high-performance gas sensors have been developed, but single gas sensors have physical defects such as cross-sensitivity. In this paper, based on the principle of electronic olfactory system, we combine gas sensor array with neural network, build a training device and training system for intelligent cross-sensitive sensors, use the trained intelligent cross-sensitive sensor system for gas identification and detection, and develop a supporting client for gas data visualization. The system achieves qualitative identification and quantitative analysis of multiple gases, and fuses BP neural network and RBF neural network to propose a more optimized algorithm model, which improves the accuracy of qualitative identification and the precision of quantitative analysis.

Clustering of Functionally Related Genes Using Machine Learning Techniques [Paper]

  • Yujing Xue and Lang Cao.
  • In 2021 5th International Conference on Compute and Data Analysis, ICCDA 2021.
  • AbstractThe clustering of functionally related genes has been an important task for biologists. With the recent progress of machine learning technology, researchers now have more powerful weapons to identify the structures within a large amount of DNA sequencing data. That allows the research on genes to be conducted in an efficient and scalable way. This paper studies the clustering of functionally related genes and their impact on the development and prognosis of lung cancer with machine learning technologies. The patient data derived from 218 patients are analyzed. We focus on two extreme cases, one case includes patients who survived less than 1 year, and the other case includes patients who survived longer than 5 years. We will investigate how different clustering methods can assist in the visualization of the DNA sequence data of such patients, and how such methods can help us identify the underlying patterns of the DNA sequence data.