Bulk Extract

Note: There are currently stability issues affecting Bulk Extract for some customers when using large date ranges.  If you receive a system error from polling the status of your job, you may have success by reducing the date range of your filter.

Marketo provides interfaces for retrieval of large sets of person and person related data, called Bulk Extract.  Currently, interfaces are offered for two object types:

  • Leads (Persons)
  • Activities

Bulk extract is performed by creating a job, defining the set of data to retrieve, enqueuing the job, waiting for the job to complete writing a file, and then retrieving the file over HTTP.  These jobs are executed asynchronously, and can be polled to retrieve the status of the export.

Authentication

The bulk extract APIs use the same OAuth 2.0 authentication method as other Marketo REST APIs.  This requires a valid access token to be embedded either as the query-string parameter “access_token={AccessToken}”, or as an HTTP header “Authorization: Bearer {AccessToken}”.

Limits

  • Max Concurrent Export Jobs: 2
  • Max Queued Export Jobs (inclusive of currently exporting jobs): 10
  • File Retention Period: 10 days
  • Max Size of Daily Export: 500MB
  • Max Time Span for Date Range Filter (createdAt or updatedAt): 30 days
Bulk Lead Extract filters for UpdatedAt and Smart List are unavailable for some subscription types.  If unavailable, a call to the Create Export Lead Job endpoint will return an error “1035, Unsupported filter type for target subscription”.  These filters will become available over the course of 2017 as part of the rollout of Marketo’s Big Data Architecture to all subscriptions.

Queue

The bulk extract APIs use a job queue (shared between leads and activities).  Extract jobs must first be created, and then enqueued by calling Create Export Lead/Activity Job and Enqueue Export Lead/Activiey Job endpoints.  Once enqueued, the jobs are pulled from the queue and started when computing resources become available.

The maximimum number of jobs in the queue is 10.  If you try to enqueue a job when the queue is full, the Enqueue Export Job endpoint will return an error “1029, Too many jobs in queue”.  A maximum of 2 jobs can run concurrently (status is “Processing”).

File Size

The bulk extract APIs are metered based on the size-on-disk of the data retrieved by a bulk extract job.  The explicit size in bytes for a job can be determined by reading the “fileSize” attribute from the completed status response of an export job.

The daily quota is a maximum of 500MB per day, which is shared between leads and activities.  When the quota is exceeded, you cannot Create or Enqueue another job until the daily quota resets at midnight Central Time.  Until that time, an error “1029, Export daily quota exceeded” is returned.

Once a job is queued or processing, it will run to completion (barring an error or job cancellation).  If a job fails for some reason, you will need to recreate it.  Files are fully written only when a job reaches the completed state (partial files are never written).

You can determine the total amount of disk used for a the current day by calling Get Export Lead/Activity Jobs.  These endpoints return a list of all jobs in the past 10 days.  You can filter that list down to just the jobs that completed in the current day (using “status” and “finishedAt” attributes).  Then then sum the file sizes for those jobs to produce the total amount used.  There is no way to delete a file to reclaim disk space.

Permissions

Bulk Extract uses the same permissions model as the Marketo REST API, and does not require any additional special permissions in order to use, though specific permissions are required for each set of endpoints.

Creating a Job

Marketo’s bulk extract APIs use the concept of a job for initiating and executing data extraction.  Let’s look at creating a simple lead export job.

This simple request will construct a job that will return the values contained in the “firstName” and “lastName” fields, with the column headers “First Name” and “Last Name” as a CSV file, containing each lead created between January 1st 2017 and January 30th 2015.

When we create the job it will return a job id in the “exportId” attribute.  We can then use this job id to enqueue the job, cancel it, check its status, or retrieve the completed file.

Common Parameters

Each job creation endpoint shares some common parameters for configuring the file format, field names, and filter of a bulk extract job.  Each subtype of extract job may have additional parameters:

Parameter Data Type Notes
format String Determines the file format of the extracted data with options for comma-separated values, tab-separated values, and semi-colon-separated values.  Accepts one of: CSV, SSV, TSV.  The format defaults to CSV.
columnHeaderNames Object Allows setting the names of column headers in the returned file.  Each member key is the name of the column header to rename, and the value is the new name of the column header.  E.g.
“columnHeaderNames”: {
“firstName”: “First Name”,
“lastName”: “Last Name”
},
filter Object Filter applied to the extract job.  Types and options will vary between job types.

Retrieving Jobs

In some cases, you may need to retrieve your recent jobs.  This is easily done with the Get Export Jobs for the corresponding object type.  Each Get Export Jobs endpoint supports a “status” filter field, a  “batchSize” to limit number of jobs returned, and “nextPageToken” for paging through large result sets.  The status filter will support each valid status for an export job: Created, Queued, Processing, Cancelled, Completed, and Failed.  The batchSize has a maximum and default of 300.  Let’s get the list of Lead Export Jobs:

The endpoint will respond with status response of each job created in the past 7 days for that object type in the result array.  The response will only include results for jobs owned by the API user making the call.

Starting a Job

With our job id in hand, let’s start the job:

This kicks off the execution of the job and returns a status response back.  Since the export is always done asynchronously, we will need to poll the status of the job to determine if it has been completed.  Note that the status for a given job will not be updated more frequently than once every 60 seconds, so the status should never be polled more frequently than that.  Keep in mind, however, that the great majority of use cases should not ever require polling more frequently than once every 5 minutes.  Data from each successful export is held for 10 days.

Polling Job Status

Determining the status of the job is simple.

The inner “status” will indicate the progress of the job, and may be one of the following values: Created, Queued, Processing, Cancelled, Completed, Failed.  In this case our job has completed, so we can stop polling and continue on to retrieve the file.  When completed, the fileSize will indicate the total length of the file in bytes.

Retrieving Your Data

When your job has completed, you can easily retrieve the file.

The response will contain a file formatted in the way that the job was configured.  The endpoint will respond with the contents of the file.  To support partial and resumption-friendly retrieval of extracted data, the file endpoint optionally supports the HTTP header “Range” of the type “bytes” (per RFC 7233).  If the header is not set, the whole of the contents will be returned.  To retrieve the first 10,000 bytes of a file, you would pass the following header as part of your GET request to the endpoint, starting from byte 0:

When retrieving the partial file, the endpoint will respond with status code 206, as well as returning the Accept-ranges, Content-Length, and Content-Range headers:

Partial Retrieval and Resumption

Files can be retrieved in part, or resume at a later time using the Range header.  The range for a file begins at byte 0, and ends at the value of “fileSize” minus 1.  The length of the file is also reported as the denominator in the value of the Content-Range response header when calling a Get Export File endpoint.  If a retrieval fails partially, it can be resumed later.  For example, if you try to retrieve a file 1000 bytes long, but only the first 725 bytes were received, the retrieval can be retried from the point of failure by calling the endpoint again and passing a new range:

This will return the remaining 275 bytes of the file.

Cancelling a Job

If a job was configured incorrectly, or becomes unnecessary, it can be easily cancelled:

This will respond with a status indicating that the job has been cancelled.