# Document RAG System.

In [None]:
# install needed libraries to use
!pip -q install langchain langchain-google-genai langchain-community google-genai faiss-cpu tiktoken python-dotenv pypdf langchain-huggingface sentence-transformers

In [None]:
"""Google Colab environment setup"""
# import os

# # set environment variables for google and huggingface

# os.environ['GOOGLE_API_KEY'] = userdata.get("GOOGLE_API_KEY")
# os.environ['HUGGINGFACEHUB_ACCESS_TOKEN'] = userdata.get("HUGGINGFACEHUB_ACCESS_TOKEN")

### Load keys.

In [None]:
import os
from dotenv import load_dotenv

load_dotenv()

openai_api_key = os.getenv("OPENAI_API_KEY")
gemini_api_key = os.getenv("GEMINI_API_KEY")

print("OpenAI key loaded:", bool(openai_api_key))
# print("OpenAI key:", openai_api_key)

print("\nGemini key loaded:", bool(gemini_api_key))
# print("Gemini key:", gemini_api_key)


OpenAI key loaded: True

Gemini key loaded: True


In [2]:
# import necessary libraries

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings,ChatGoogleGenerativeAI,GoogleGenerativeAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_community.document_loaders import PyPDFLoader
from langchain_huggingface import HuggingFaceEmbeddings

  from .autonotebook import tqdm as notebook_tqdm


### Test key with prompt.

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI

LLM = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=gemini_api_key
)

response = LLM.invoke("Hello")
print(response)


E0000 00:00:1759189263.070101  109561 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


content='Hello! How can I help you today?' additional_kwargs={} response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-2.5-flash', 'safety_ratings': []} id='run--0c2a5214-4f37-43b5-b432-836a3abe7058-0' usage_metadata={'input_tokens': 2, 'output_tokens': 46, 'total_tokens': 48, 'input_token_details': {'cache_read': 0}, 'output_token_details': {'reasoning': 37}}


In [53]:
print(response.content)

Hello! How can I help you today?


## Load document.

In [None]:
# load and read PDF file

load_document = PyPDFLoader("dataset/ChenZhang_cropmapping_ReviewPaper.pdf")
document = load_document.load()

In [7]:
len(document)

29

In [8]:
# First page of PDF
print(document[0].page_content) 

Review
Remote sensing for crop mapping: A perspective on current and future 
crop-specific land cover data products
Chen Zhang
a , *
, Hannah Kerner
b
, Sherrie Wang
c
, Pengyu Hao
d
, Zhe Li
e
, Kevin A. Hunt
e
,  
Jonathon Abernethy
e
, Haoteng Zhao
f
, Feng Gao
f
, Liping Di
a , *
, Claire Guo
a , g
, Ziao Liu
a
,  
Zhengwei Yang
e
, Rick Mueller
e
, Claire Boryan
e
, Qi Chen
h
, Peter C. Beeson
i
, Hankui K. Zhang
j
,  
Yu Shen
j , k
a
Center for Spatial Information Science and Systems, George Mason University, Fairfax, VA 22030, USA
b
School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ 85281, USA
c
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
d
Food and Agriculture Organization of the United Nations, Viale delle Terme di Caracalla, 00153 Rome, Italy
e
U.S. Department of Agriculture, National Agricultural Statistics Service, Washington, DC 20250, USA
f
U.S. Department of Agriculture, Agricultur

In [9]:
# 10th page of PDF
print(document[9].page_content)

the WoS database, including the publication title, abstract, or keywords. 
In our survey, we found that many papers introduced, discussed, or cited 
CDL, but did not directly use the data in their experiments. Therefore, 
IC1 could ensure that CDL has been applied in the selected publications, 
rather than simply mentioning it in passing.
To narrow down the publications to those specifically related to 
remote sensing, IC2 states that the publication ’ s “ Category ” field in the 
WoS database must be labeled as “ remote sensing ” . However, many 
publications related to remote sensing were published in computer sci -
ence, agricultural, or multidisciplinary journals, which were not cate -
gorized as “ remote sensing ” . To include these publications in this 
review, we added a rule that requires the presence of certain terms, such 
as “ Remote Sensing ” , “ Earth observation ” , “ Landsat ” , “ Sentinel ” , or 
“ MODIS ” in any of the title, keywords, or abstract of the publication.
T

In [10]:
# 14th page of PDF
print(document[13].page_content)

et al., 2013 ). CDL data also have been used to delineate and stratify 
regions, such as U.S. soybean growing areas ( Song et al., 2017 ), which 
helps in understanding field size patterns for more effective agricultural 
resource management.
Training samples: Beyond a crop type map, CDL is widely utilized 
as an authoritative geospatial benchmark to support field-level crop 
spectral signature training. The ML models trained with high-confidence 
pixels in CDL and associated products (e.g., CSB, Confidence Layer) can 
be applied to extend land cover classification while adjusting for factors 
such as hemisphere seasonality and evolving farming trends, which is 
invaluable for global crop monitoring. As discussed in RQ2, ML and DL 
are the main technologies in remote sensing studies, which rely on high- 
quality training data. Due to the extensive crop-specific land cover in -
formation, CDL has been extensively used to label training samples in EO 
data. This enables the further super

In [11]:
# 16th page of PDF
print(document[15].page_content)

and Kerner, 2023 ). Ground-truthing involves physically visiting agri -
cultural fields and recording the type of crop growing in the field. This 
process is prohibitively expensive and logistically challenging for many 
organizations and regions.
Currently available public reference samples are largely regional in 
scope ( Dufourg et al., 2023 ; Kondmann et al., 2021 ). Recent work has 
proposed novel methods of collecting ground-truth crop labels that 
reduce the cost of data collection. Paliyam et al. (2021) proposed a 
method called Street2Sat that uses computer vision (CV) techniques to 
transform roadside images of fields collected with car- and motorcycle 
helmet-mounted cameras into geo-referenced crop type labels of those 
fields. d’Andrimont et al. (2022) used CV techniques to extract crop type 
and phenology information from street-level images of fields taken with 
car-mounted cameras in the Netherlands. Yan and Ryu (2021) and Soler 
et al. (2024) used DL models to automati

## Split texts.

In [12]:
# split into chunks

doc_split= RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = doc_split.split_documents(document)

In [13]:
len(chunks)

268

In [14]:
# display 4th chunk
print(chunks[3].page_content)

crop mapping from the perspective of crop-specific land cover data by evaluating over 60 open-access opera -
tional products, archival crop type map datasets, single-crop extent map datasets, cropping pattern datasets, and 
crop mapping platforms and systems. Using the Cropland Data Layer (CDL) – one of the most widely used 
products with over 25 years of continuous monitoring of U.S. croplands – as a case study, we also conduct a 
systematic literature review on the application of crop type maps in remote sensing science. Our analysis syn -
thesizes 129 research articles through three core research questions: (1) What EO data are used with CDL; (2) 
What scientific problems and technologies are explored using CDL; and (3) What role does CDL play in remote 
sensing applications. Furthermore, we delve into the implications of our vision for new data products and 
propose emerging research topics, ranging from extending the spatiotemporal coverage of current data products


In [15]:
# display 5th chunk
print(chunks[4].page_content)

propose emerging research topics, ranging from extending the spatiotemporal coverage of current data products 
to improving global mapping reliability and developing operational in-season crop mapping systems. This review 
paper not only serves as a reference for stakeholders seeking to utilize crop-specific land cover data in their work, 
but also outlines the directions for future geospatial data product development.
* Corresponding authors.
E-mail addresses: czhang11@gmu.edu (C. Zhang), hkerner@asu.edu (H. Kerner), sherwang@mit.edu (S. Wang), pengyu.hao@fao.org (P. Hao), zhe.li@usda.gov
(Z. Li), kevin.a.hunt@usda.gov (K.A. Hunt), jake.abernethy@usda.gov (J. Abernethy), haoteng.zhao@usda.gov (H. Zhao), feng.gao@usda.gov (F. Gao), ldi@gmu. 
edu (L. Di), zliu23@gmu.edu (Z. Liu), zhengwei.yang@usda.gov (Z. Yang), rick.mueller@usda.gov (R. Mueller), claire.boryan@usda.gov (C. Boryan), qichen@


### Vector Store Creation.

Generate document embeddings and build a FAISS vector store for efficient similarity-based retrieval.

In [16]:
embeds = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vector_store = FAISS.from_documents(chunks, embeds)

  from pandas.core import (
2025-09-30 00:19:35.389072: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2025-09-30 00:19:37.212948: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Retrieval.

In [17]:
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})

In [18]:
retriever

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x76fcba598e00>, search_kwargs={'k': 5})

In [19]:
# test retriever 
retriever.invoke("what is the main topic of the document?")

[Document(id='50bfaa32-168f-4e5e-94b6-27d2668d4ef5', metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'Elsevier', 'creationdate': '2025-09-02T19:45:23+00:00', 'crossmarkdomains[1]': 'elsevier.com', 'creationdate--text': '2nd September 2025', 'robots': 'noindex', 'elsevierwebpdfspecifications': '7.0.1', 'moddate': '2025-09-02T20:26:19+00:00', 'doi': '10.1016/j.rse.2025.114995', 'title': 'Remote sensing for crop mapping: A perspective on current and future crop-specific land cover data products', 'keywords': 'Crop mapping,Land use land cover,Geospatial data product,Systematic literature review,Cropland data layer', 'subject': 'Remote Sensing of Environment, 330 (2025) 114995. doi:10.1016/j.rse.2025.114995', 'crossmarkdomains[2]': 'sciencedirect.com', 'author': 'Chen Zhang', 'source': 'ChenZhang_cropmapping_ReviewPaper.pdf', 'total_pages': 29, 'page': 9, 'page_label': '10'}, page_content='fields (title, abstract, keywords), WC represents WoS categories, and DT \nrepre

## Augmentation.

In [None]:
LLM_gen = GoogleGenerativeAI(model="models/gemini-1.5-flash", google_api_key=gemini_api_key)

E0000 00:00:1759188738.019868  109561 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [None]:
prompt = PromptTemplate(
    template = """
    You are a helpful assistant.
    Answer ONLY from the provided transcript context.
    If the context IS INSUFFICIENT, say you don't know.

    {context}

    Question: {question}
    """,
    input_variables=["context","question"]
)

In [30]:
question = "Is the aspect of stars mentioned in this document provided? If yes, explain what was discussed?"
retrieved_documents = retriever.invoke(question)

In [31]:
retrieved_documents

[Document(id='b700e905-8222-4ec0-8131-bee3ac9f51ca', metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'Elsevier', 'creationdate': '2025-09-02T19:45:23+00:00', 'crossmarkdomains[1]': 'elsevier.com', 'creationdate--text': '2nd September 2025', 'robots': 'noindex', 'elsevierwebpdfspecifications': '7.0.1', 'moddate': '2025-09-02T20:26:19+00:00', 'doi': '10.1016/j.rse.2025.114995', 'title': 'Remote sensing for crop mapping: A perspective on current and future crop-specific land cover data products', 'keywords': 'Crop mapping,Land use land cover,Geospatial data product,Systematic literature review,Cropland data layer', 'subject': 'Remote Sensing of Environment, 330 (2025) 114995. doi:10.1016/j.rse.2025.114995', 'crossmarkdomains[2]': 'sciencedirect.com', 'author': 'Chen Zhang', 'source': 'ChenZhang_cropmapping_ReviewPaper.pdf', 'total_pages': 29, 'page': 9, 'page_label': '10'}, page_content='the WoS database, including the publication title, abstract, or keywords. \nIn o

In [47]:
content_texts = "\n\n".join(document.page_content for document in retrieved_documents)

In [48]:
content_texts

'the WoS database, including the publication title, abstract, or keywords. \nIn our survey, we found that many papers introduced, discussed, or cited \nCDL, but did not directly use the data in their experiments. Therefore, \nIC1 could ensure that CDL has been applied in the selected publications, \nrather than simply mentioning it in passing.\nTo narrow down the publications to those specifically related to \nremote sensing, IC2 states that the publication ’ s “ Category ” field in the \nWoS database must be labeled as “ remote sensing ” . However, many \npublications related to remote sensing were published in computer sci -\nence, agricultural, or multidisciplinary journals, which were not cate -\ngorized as “ remote sensing ” . To include these publications in this \nreview, we added a rule that requires the presence of certain terms, such \nas “ Remote Sensing ” , “ Earth observation ” , “ Landsat ” , “ Sentinel ” , or \n“ MODIS ” in any of the title, keywords, or abstract of the 

In [49]:
final_prompt = prompt.invoke({"context":content_texts,"question":question})

In [50]:
final_prompt

StringPromptValue(text="\n    You are a helpful assistant.\n    Answer ONLY from the provided transcript context.\n    If the context IS INSUFFICIENT, just say you don't know and probably need more information.\n\n    the WoS database, including the publication title, abstract, or keywords. \nIn our survey, we found that many papers introduced, discussed, or cited \nCDL, but did not directly use the data in their experiments. Therefore, \nIC1 could ensure that CDL has been applied in the selected publications, \nrather than simply mentioning it in passing.\nTo narrow down the publications to those specifically related to \nremote sensing, IC2 states that the publication ’ s “ Category ” field in the \nWoS database must be labeled as “ remote sensing ” . However, many \npublications related to remote sensing were published in computer sci -\nence, agricultural, or multidisciplinary journals, which were not cate -\ngorized as “ remote sensing ” . To include these publications in this \nr

## Answer Generation.

In [54]:
response = LLM.invoke(final_prompt)

In [55]:
response.content

"I don't know and probably need more information, as the provided transcript does not mention the aspect of stars."

## Build chain.

In [56]:
# import libraries for chain building
from langchain_core.runnables import RunnableParallel,RunnablePassthrough,RunnableLambda
from langchain_core.output_parsers import StrOutputParser

In [57]:
def reformat_doc(retrieved_documents):
  content_texts = "\n\n".join(document.page_content for document in retrieved_documents)
  return content_texts

In [58]:
parallel_chain = RunnableParallel({
    "context": retriever | RunnableLambda(reformat_doc),
    "question": RunnablePassthrough()
}
)

In [59]:
parallel_chain.invoke('Quickly and briefly summarize the document')

{'context': 'as “ Remote Sensing ” , “ Earth observation ” , “ Landsat ” , “ Sentinel ” , or \n“ MODIS ” in any of the title, keywords, or abstract of the publication.\nTo ensure the selected publications reflected the up-to-date research \ntrends and avoided duplicate research items, IC3 limits the document \ntype to only peer-reviewed articles that were published in journals \nindexed by the WoS Core Collection. Focusing on these high-impact \njournal articles guarantees that our review reflects the most represen -\ntative studies within the remote sensing field.\nThe query string of inclusion criteria in the WoS data database is: \nALL = ( “ Cropland Data Layer ” OR “ CDL ” ) AND (WC = “ Remote Sensing ” \nOR ALL = ( “ Remote Sensing ” OR “ Earth observation ” OR “ Landsat ” OR \n“ Sentinel ” OR “ MODIS ” )) AND DT = “ Article ” , where ALL represents all \nfields (title, abstract, keywords), WC represents WoS categories, and DT \nrepresents document type. After the initial screenin

In [60]:
parse = StrOutputParser()

In [61]:
main_chain = parallel_chain | prompt | LLM | parse

In [62]:
print(main_chain.invoke("Quickly and briefly summarize the document"))

This document outlines a methodology for selecting relevant literature on "Cropland Data Layer" (CDL) within the remote sensing field. It details specific inclusion and exclusion criteria, keywords used for searching ("Remote Sensing", "Earth observation", "Landsat", "Sentinel", "MODIS", "Cropland Data Layer", "CDL"), and the databases utilized (Web of Science Core Collection and USDA NASS CDL website). The process involved screening, manually applying exclusion criteria, and removing duplicate records, ultimately identifying a specific number of articles from each source.


In [63]:
print(main_chain.invoke("Quickly and briefly summarize the document. Put them in bullet format now."))

*   The document lists numerous authors and their specific contributions to the work, including writing, project administration, methodology, supervision, funding acquisition, validation, and providing resources.
*   It describes the methodology for screening qualified publications related to the Cropland Data Layer (CDL) in the remote sensing field.
*   The literature screening used the Web of Science (WoS) Core Collection and the USDA NASS website.
*   Inclusion criteria required specific keywords related to "Cropland Data Layer" or "CDL" and remote sensing terms (e.g., "Remote Sensing", "Landsat", "MODIS") in the title, abstract, or keywords, focusing on peer-reviewed articles.
*   Exclusion criteria were applied to remove irrelevant publications, such as those where "CDL" was not related to "Cropland Data Layer," studies not using remote sensing data, or review articles.
*   The initial screening process yielded 162 articles from the WoS database and 43 from the USDA NASS CDL websi