Bulk Extract

Marketo provides interfaces for retrieval of large sets of person and person related data, called Bulk Extract.  Currently, interfaces are offered for three object types:

  • Leads (Persons)
  • Activities
  • Program Members
  • Custom Objects

Bulk extract is performed by creating a job, defining the set of data to retrieve, enqueuing the job, waiting for the job to complete writing a file, and then retrieving the file over HTTP.  These jobs are executed asynchronously, and can be polled to retrieve the status of the export.

Note: Bulk API endpoints are not prefixed with ‘/rest’ like other endpoints.

Authentication

The bulk extract APIs use the same OAuth 2.0 authentication method as other Marketo REST APIs.  This requires a valid access token to be embedded either as the query-string parameter “access_token={AccessToken}”, or as an HTTP header “Authorization: Bearer {AccessToken}”.

Limits

  • Max Concurrent Export Jobs: 2
  • Max Queued Export Jobs (inclusive of currently exporting jobs): 10
  • File Retention Period: 7 days
  • Default Daily Export Allocation: 500MB (which resets daily at 12:00AM CST).  Increases available for purchase.
  • Max Time Span for Date Range Filter (createdAt or updatedAt): 31 days
Bulk Lead Extract filters for UpdatedAt and Smart List are unavailable for some subscription types.  If unavailable, a call to the Create Export Lead Job endpoint will return an error “1035, Unsupported filter type for target subscription”.  Customers may contact Marketo Support to have this funtionality enabled in their subscription.

Queue

The bulk extract APIs use a job queue (shared between leads, activities, program members, and custom objects).  Extract jobs must first be created, and then enqueued by calling Create Export Lead/Activity/Program Member Job and Enqueue Export Lead/Activity/Program Member Job endpoints.  Once enqueued, the jobs are pulled from the queue and started when computing resources become available.

The maximimum number of jobs in the queue is 10.  If you try to enqueue a job when the queue is full, the Enqueue Export Job endpoint will return an error “1029, Too many jobs in queue”.  A maximum of 2 jobs can run concurrently (status is “Processing”).

File Size

The bulk extract APIs are metered based on the size-on-disk of the data retrieved by a bulk extract job.  The explicit size in bytes for a job can be determined by reading the fileSize attribute from the completed status response of an export job.

The daily quota is a maximum of 500MB per day, which is shared between leads, activities, program members, and custom objects.  When the quota is exceeded, you cannot Create or Enqueue another job until the daily quota resets at midnight Central Time.  Until that time, an error “1029, Export daily quota exceeded” is returned.  Aside from the daily quota, there is no maximum file size.

Once a job is queued or processing, it will run to completion (barring an error or job cancellation).  If a job fails for some reason, you will need to recreate it.  Files are fully written only when a job reaches the completed state (partial files are never written).  You can verify that a file was fully written by computing it’s SHA-256 hash and comparing that with the checksum that is returned by job status endpoints.

You can determine the total amount of disk used for a the current day by calling Get Export Lead/Activity/Program Member Jobs.  These endpoints return a list of all jobs in the past 7 days.  You can filter that list down to just the jobs that completed in the current day (using status and finishedAt attributes).  Then then sum the file sizes for those jobs to produce the total amount used.  There is no way to delete a file to reclaim disk space.

Permissions

Bulk Extract uses the same permissions model as the Marketo REST API, and does not require any additional special permissions in order to use, though specific permissions are required for each set of endpoints.

Bulk Extract jobs are only accessible to the API user which created them, including polling for status and retrieving file contents.

Note that Bulk Extract endpoints are not aware of Marketo workspaces.  Extraction requests always include data across all workspaces, regardless of how you define the API Only User for your Custom Service.

Creating a Job

Marketo’s bulk extract APIs use the concept of a job for initiating and executing data extraction.  Let’s look at creating a simple lead export job.

This simple request will construct a job that will return the values contained in the “firstName” and “lastName” fields, with the column headers “First Name” and “Last Name” as a CSV file, containing each lead created between January 1st 2017 and January 31st 2017.

When we create the job it will return a job id in the exportId attribute.  We can then use this job id to enqueue the job, cancel it, check its status, or retrieve the completed file.

Common Parameters

Each job creation endpoint shares some common parameters for configuring the file format, field names, and filter of a bulk extract job.  Each subtype of extract job may have additional parameters:

Parameter Data Type Notes
format String Determines the file format of the extracted data with options for comma-separated values, tab-separated values, and semi-colon-separated values.  Accepts one of: CSV, SSV, TSV.  The format defaults to CSV.
columnHeaderNames Object Allows setting the names of column headers in the returned file.  Each member key is the name of the column header to rename, and the value is the new name of the column header.  E.g.
“columnHeaderNames”: {
“firstName”: “First Name”,
“lastName”: “Last Name”
},
filter Object Filter applied to the extract job.  Types and options will vary between job types.

Retrieving Jobs

In some cases, you may need to retrieve your recent jobs.  This is easily done with the Get Export Jobs for the corresponding object type.  Each Get Export Jobs endpoint supports a status filter field, a  batchSize to limit number of jobs returned, and nextPageToken for paging through large result sets.  The status filter will support each valid status for an export job: Created, Queued, Processing, Cancelled, Completed, and Failed.  The batchSize has a maximum and default of 300.  Let’s get the list of Lead Export Jobs:

The endpoint will respond with status response of each job created in the past 7 days for that object type in the result array.  The response will only include results for jobs owned by the API user making the call.

Starting a Job

With our job id in hand, let’s start the job:

This kicks off the execution of the job and returns a status response back.  Since the export is always done asynchronously, we will need to poll the status of the job to determine if it has been completed.  Note that the status for a given job will not be updated more frequently than once every 60 seconds, so the status should never be polled more frequently than that.  Keep in mind, however, that the great majority of use cases should not ever require polling more frequently than once every 5 minutes.  Data from each successful export is held for 10 days.

Polling Job Status

Determining the status of the job is simple.

Note: Status can only be polled for jobs created by the same API user that created them.

The inner status member will indicate the progress of the job, and may be one of the following values: Created, Queued, Processing, Cancelled, Completed, Failed.  In this case our job has completed, so we can stop polling and continue on to retrieve the file.  When completed, the fileSize member indicates the total length of the file in bytes, and the fileChecksum member contains the SHA-256 hash of the file.  Job status is available for 30 days after Completed or Failed status was reached.

Retrieving Your Data

When your job has completed, you can easily retrieve the file.

The response will contain a file formatted in the way that the job was configured.  The endpoint will respond with the contents of the file.   If a job has not completed, or an bad job id is passed, file endpoints will respond with a status of 404 Not Found, and a plaintext error message as the payload, unlike most other Marketo REST endpoints.

To support partial and resumption-friendly retrieval of extracted data, the file endpoint optionally supports the HTTP header Range of the type bytes (per RFC 7233).  If the header is not set, the whole of the contents will be returned.  To retrieve the first 10,000 bytes of a file, you would pass the following header as part of your GET request to the endpoint, starting from byte 0:

When retrieving the partial file, the endpoint will respond with status code 206, as well as returning the Accept-ranges, Content-Length, and Content-Range headers:

Partial Retrieval and Resumption

Files can be retrieved in part, or resume at a later time using the Range header.  The range for a file begins at byte 0, and ends at the value of fileSize minus 1.  The length of the file is also reported as the denominator in the value of the Content-Range response header when calling a Get Export File endpoint.  If a retrieval fails partially, it can be resumed later.  For example, if you try to retrieve a file 1000 bytes long, but only the first 725 bytes were received, the retrieval can be retried from the point of failure by calling the endpoint again and passing a new range:

This will return the remaining 275 bytes of the file.

File Integrity Verification

The job status endpoints return a checksum in the fileChecksum attribute when status is “Completed”.  The checksum is a SHA-256 hash of the exported file.  You can compare the checksum with the SHA-256 hash of the retrieved file to verify that it is complete.

Here is an example response containing the checksum:

Here is an example of creating the SHA-256 hash of a retrieved file named “bulk_lead_export.csv” using the sha256sum command line utility:

Cancelling a Job

If a job was configured incorrectly, or becomes unnecessary, it can be easily cancelled:

This will respond with a status indicating that the job has been cancelled.