How to Retrieve Every Person (Lead)

March 5, 2020 | by

This is a follow-up to How to Retrieve Every Custom Object. We get many inquiries about the processes required to retrieve every person (lead) from a Marketo Engage instance. We have provided many useful answers, but none has been as complete as this one. I have identified a few key concepts needed to extract every lead using Marketo’s Bulk Extract API. All the other details can be learned from the demonstration code I created to go with it. After reading this post and exploring the demo code, you will have all the information you need to know to retrieve every lead from a Marketo Engage instance.

Technique

Overview

The core technique uses the Bulk Lead Extract API. You might expect to be able to create a bulk lead export job with no filter, but you can’t: the API requires a filter. The available filters are createdAt, staticListName, staticListId, updatedAt, smartListName, and smartListId. Filtering by a Smart List with no filters also seems attractive. Try that and you’ll find that the system is smart enough to treat a Smart List with no filter the same: the API requires a filter here too. Since we need a filter, the trustworthy and canonical filter to use is createdAt. This filter type permits datetime ranges up to 31 days, so we will need to run multiple jobs and combine the results.

We start by finding the oldest possible create date for a lead in the target instance. Starting at that oldest possible date, we’ll create jobs spanning 31 days minus one second (more on that later). After creating each job, we will enqueue it and wait for it to complete. Then we’ll download the resulting file, and check its integrity using a checksum. And finally, deduplicate leads by ID then write unique leads to an output CSV file.

Find Your Oldest Lead

I’m using a little “trick” to get the oldest possible date for a lead in the target instance. There’s no API endpoint dedicated to that task, so we need a little creativity. What I do is query all folders with a maxDepth = 1 which will give us a list of all the top-level folders in the instance. Then I collect the createdAt dates, parse them, and find the oldest date. This method works because some default, top-level folders are created with the instance and no leads could be created before then.

Select Required Fields

You need to decide which fields you need to extract. Find the available fields for your target instance using the Describe Lead 2 endpoint. The response to that request will include a list named “fields”. Here’s an excerpt from an example response:


This endpoint returns an exhaustive list including both standard and custom fields. The more fields you request, the longer your export job will take to complete and the larger the resulting file will be. You should typically choose just the fields you need. Nothing prevents you from requesting every available field, so that’s what I’m demoing. The field identifier required when creating an export job is the name value. I will extract the name values to a list of all field names. Then I’ll use it to request every available field when I create each export job.

Export Job Date Ranges: 31 Days Each

Each export job can span up to 31 days. The demo instance I’m using was created in August of 2016, so I’ll need to create a little over 40 jobs today. That’s the number of days since the first create date divided by 31 rounded up. The API permits two export jobs to be processing at once so you could extract with two jobs running in parallel. Bulk extract jobs are a resource shared with every other integration, so I’m going to be nice. I leave the other available job for other integrations and will demonstrate running single jobs one after the other.

The dates used for the createdAt filter are formatted using the ISO 8601 specification. They are always in GMT (Z+0000) so the timezone will be simply be represented as “Z” or “+00:00”. August 1st, 2016 is 2016-08-01T00:00:00+00:00 and 31 days later is September 1st, 2016 which is 2016-09-01T00:00:00+00:00. Both start and end times are inclusive, so I’m going to subtract 1 second from that ending time: 2016-09-01T00:00:00+00:00 becomes 2016-08-31T23:59:59+00:00. Subtracting a second avoids overlapping times. Since GMT is the default, you can also leave the Z or +00:00 off.

Deduplication

Even though I’ve gone to the trouble of avoiding overlapping times, I also implemented de-duplication. I did that since there are some edge cases when times change (Daylight Saving Time) resulting in ambiguous values, and, as a result, Marketo’s Bulk Extract API can return otherwise unexpected duplicate leads. It’s rare that this happens, but needs to be accounted for in any integration using datetime filter ranges. I removed one second to make clear that times are inclusive. I wouldn’t want you to think that creating a job with createdAt and endAt times of  2016-08-01T00:00:00Z and 2016-09-01T00:00:00Z respectively won’t include leads created on 2016-09-01T00:00:00Z; it will.

Execution

Create a Job

The first step is to create a job using the Create Export Lead Job endpoint. In this demo, the request to create our first export job looks like this:

The response will look like this:

Enqueue the Job

The job is now created but just sitting there doing nothing. To run the job, we’ll need to call the enqueue endpoint using the exportId value to build the URI for the request. That will look like this:

There’s no body for this POST, we are simply using the POST HTTP verb here. That request will generate a response like this:

As I mentioned earlier, you are limited in the number of jobs that can be run at a time. There is also a limit to the number of jobs queued at one time: 10. We need more than 40 so that limit prevents us from creating all the jobs at once. Other integrations can run jobs too, so we need to account for the possibility that all slots are full. Trying to enqueue a new job when there are already 10 queued jobs will generate a 1029 error. When you get a 1029, use an exponential backoff until the job can be enqueued. I wait 1 minute and double that value each time I get a 1029 error code up to 4 minutes between requests, but never longer than that. This technique is known as Truncated Binary Exponential Backoff and is best practice for recoverable errors and status checks.

Wait for Job to Complete

Each job will take some time to run so we will call the status endpoint to monitor its progress. Again, we’ll include the exportId in the request URI like this:

Before the job is complete, you’ll get a response that looks like this:

I execute the same exponential backoff (1 minute up to 4 minutes) while the job status is “Queued”. The status isn’t real-time; it’s updated once per minute and there’s very little benefit to polling faster. I reset the backoff when the job status changes to “Processing”. We are waiting for a status “Completed”, which looks like this:


The numberOfRecords value will be zero when the request returns no leads. I check this value and skip the next steps when that makes sense. When leads are returned, I extract the fileChecksum value. We will use it to check the integrity of the file when it’s downloaded.

Get Your Leads

If the numberOfRecords is larger than zero, download the exported file using Get Export Lead File with a request like this:

Check that the file transferred without error: calculate the checksum of the file and compare to the fileChecksum we saved earlier. Calculate the checksum using SHA-2 and specifically the SHA256 hash function. If the calculated checksums don’t match, there was an error in file transfer and you can either try the transfer again or abort and recover manually.

Aggregate the Data

After cycling through every 31-day range from the first lead until today, you’ll have a complete set. You’ll also have one file for each range. The simplest way to build a single aggregated file with every lead would be to concatenate these files after removing the header row for all but the first file. If you do that, don’t forget to expect and plan for potential duplication later in your data processing pipeline. In my demonstration, I process the files as I download them. Before adding each row of data to the output file, I deduplicate by checking to see if the row’s lead ID has already been written.

Conclusion

I’ve developed some demonstration code hosted here which hopefully fills in the details of this process and can serve as a template for your own development. The demo code is intended to be a learning tool so there are improvements to robustness that would be required for a production system. The code is provided AS-IS under the MIT license, but it’s probably good enough for one-off usage with human supervision.

Nothing is stopping you now! When you follow this process, you will extract every lead using Marketo’s Bulk Extract API and potentially every field for the target Marketo Engage instance. To extend your data further, get each lead’s activities using the techniques in: How-To Retrieve Activities for a Single Lead using REST API and keep lead data updated using the techniques you’ll learn in: How-To Synchronizing Lead Data Changes using REST API.