The Model Development Lifecycle with LLMs and DSPy
In his MLOps course, Andrew Ng describes the Model Development Cycle. The core lesson is that model development is a continuous cycle and that at every stage, you’re revisited data collection.
I ran through this cycle for my recent LLM project starting from nothing. No data, no model, no deployment. I want to provide my learnings for how to apply this process to LLM finetuning specifically. I used DSPy for my project and I actually think it’s particularly well suited for the Model Development Cycle mental model.
Scoping
In order to help people use a React Map Component I built, I’m making a Map Maker website. The user can upload a CSV with columns Title, Description, Address, Lat, Long, and then publish an interactive map.
To make the whole thing a little easier though, I’ll allow the user to upload their data in any format they have (csv, html, kml, json), and on the backend I’ll reshape it into the CSV with the desired columns.
Modeling Round 1
Building a Zero Shot Model with DSPy
Ng says you always start the model development cycle with Data. Moreover, you should start with the simplest model possible, and only go for a more advanced model when you prove you need it. Well I think LLMs flip this on it’s head. Instead, I think it’s nice to start a project with the biggest model you can afford, running zero shot on your task. This lets you skip the first round of data collection and get something built day 1.
In my past LLM projects though, I’ve found that doing zero shot modeling can lead to endless prompt rewriting. With no data to fine tune on, the model behaves poorly, and I’ll cram instructions into the prompt to get it to behave. I wanted to avoid that, especially since I knew that later I would collect real data.
I recently read about the DSPy library which promises “Programming not Prompting.” The library supports Constrained Decoding, where you can define the output format in code which every generation needs to follow. To use constrained decoding, you define a “Signature” which specifies it’s input and output formats. When you initiate a Predictor from your Signature, you can rely on DSPy to write the underlying prompts needed in order to get the Language Model to produce structured output that honors the Signature.
class Location(BaseModel):
Title: Optional[str] = Field()
Address: Optional[str] = Field()
Description: Optional[str] = Field()
Lat: Optional[float] = Field(None)
Long: Optional[float] = Field(None)
class Input(BaseModel):
file_contents: str = Field(description="Arbitrary file contents from which to extract location data.")
class Output(BaseModel):
locations: List[Location] = Field()
class ExtractLocationsSignature(dspy.Signature):
input: Input = dspy.InputField()
output: Output = dspy.OutputField()
lm = dspy.LM('openai/gpt-4o-mini', api_key=api_key, max_tokens=10000)
dspy.configure(lm=lm)
predictor = dspy.ChainOfThought(ExtractLocationsSignature)
Deployment Round 1
Deploying the Zero Shot Model
I wanted to present at the December AI Tinkerers meeting so before taking the time to come up with test data or an evaluation function, I went straight to deployment and built this nifty demo that sort of worked!
Data Round 1
Collecting failing inputs
My live demo actually ended up failing. I chose a file with too many locations, and I hit the max output size of the model. As I write this in February of 2025, I can’t find a model available to me with output size more than 16k tokens. Here is the current Open AI output token limits. The context window is huge, but the output tokens are lacking.
Even if it could produce an output this size, the latency is actually intolerable as well.
Modeling Round 2
Simplify the task
Model 2 would just look at one chunk of a file at a time and find the locations within that chunk. This way, any size file should work, and I can actually speed up inference by running each chunk in parallel. Here is the Signature for Model 2:
class ChunkedInput(BaseModel):
file_contents_chunk: str = Field(description="A chunk of a text file of which to extract locations from.")
example_location: Optional[Location] = Field(None, description="Example output to try to mimic")
class Output(BaseModel):
locations: List[Location] = Field(description="Location seen in this section of the file.")
class ChunkedLocationSignature(dspy.Signature):
input: ChunkedInput = dspy.InputField()
output: Output = dspy.OutputField()
predictor = dspy.Predict(ChunkedLocationSignature)
Data Round 2
Annotating a handful of input/output pairs
Data Collection is hard. When searching for tips on how to do it, I found this video discussing what they called “annotation ergonomics”. They say some people think you should just use google sheets, and other people go through the effort of making great end to end pipelines to make annotating easy. I did look at some annotation tools (argilla, snorkel, pigeon) but ultimately decided google sheets seemed good enough.
To collect rows, I wrote a python script which takes in a file, chunks it, run the model on every chunk, then writes to a csv with two columns: input and output. After that, I just used google sheets to review the model outputs.
I have a rule I’m starting to form about LLM tasks: if you can’t do it, the model can’t do it. If you’re looking at the prompt and thinking, “how the heck would I do that”, then it’s likely the model isn’t going to generate great output. When I first started labeling input/output pairs, the input chunks were poorly formatted with cutoffs in bad places. I was chunking just based on character count and but it led to poorly formatted files which sometimes were missing key fields about a location.
In order to make the task easier, I wrote per filetype preprocessors for likely filetypes. For each I wrote a pretty printer and a chunker that uses the nifty RecursiveCharacterTextSplitter from LangChain. For example here is my text splitter for KML files.
def chunk_xml(xml_str: str) -> List[str]:
xml_str = pretty_print_xml(xml_str)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=3000,
chunk_overlap=0,
separators= [
r"<\/Placemark>", # end of placemark for KML
r"</\w+>", # closing tag
r"\n", # newlines
r"\s+", # any run of whitespace
"" # fallback to individual characters
],
keep_separator="end",
is_separator_regex=True,
)
return text_splitter.split_text(xml_str)
Now that I had my problem framed into an LLM task that I myself was actually capable of doing, I annotated 4 diverse examples to produce my first training set! Interestingly, the zero shot model performance at this stage was actually already pretty good. I think that zero shot performance is a good indicator of how well framed a task is.
Modeling Round 3
Adding Few Shot Examples to the prompt
Now that I had a training set (with all of 4 rows), I added them to my prompt as Few Shot examples. In DSPy, this is how you add examples as few shot to your prompt.
predictor = dspy.Predict(LocationSignature)
labeled_examples = load_outputs("labeled-data.csv")
optimizer = dspy.teleprompt.LabeledFewShot()
predictor = optimizer.compile(student=predictor, trainset=labeled_examples)
Deployment Round 2
Deploying the chunked model
I have Model 2 deployed and I’m collecting more data from a few friends before I share it wider!
Takeaways
My Opinion of DSPy
DSPy scales to the amount of data you have available. For the first 2 rounds of modeling, I had zero example data and no evaluation function but I was able to produce a decent model. Once I had collected a few rows I was able to incorporate them as few shot examples. And once I additionally collect more rows, DSPy provides more ways to optimize a predictor using that data e.g. training an adapter.
The Model Development Cycle
I think that the Model Development Cycle is the right mental model for LLM tasks, but with the slight addendum that if you have no data, you should just start with building a zero shot model. From there, I think the best thing you can do is build an annotation workflow and start annotating inputs and outputs, and reframing the input data whenever it seems to you that the task is too hard as is.
Next Steps
What I need next is a plan for how to actually collect data once I share my deployed model. It seems easy to collect positive signals from the user and send these rows back into data training, but how do I use negative examples where the user was upset with the generation process?
I am curious, is there a general method for bootstrapping a model when all you have is unstructured data and a human to say yes/no on the outputs? (besides the process I’m doing of re-annotating the failures to create positive examples)
If you have any advice, I’d love to hear from you!