# Cerebras

### What is Cerebras?

Cerebras AI is a high-performance inference provider built on wafer-scale engine technology - the world's largest chip - purpose-built to accelerate large language model inference. Unlike GPU-based providers, Cerebras hardware processes entire models on a single wafer, eliminating memory bandwidth bottlenecks and delivering significantly faster token generation speeds. Stack AI integrates natively with Cerebras, letting you connect your Cerebras API key and use Llama-family models directly inside your workflows through both the LLM node and the Cerebras action node.

***

### How to use it?

Add an **LLM** node or a **Cerebras** action node to your workflow, select Cerebras as the provider, choose a model, configure your generation parameters, and run. Stack AI handles authentication, request routing, and output parsing automatically.

<figure><img src="https://3697023207-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FFSlso1Kjob5CLDrh0dVn%2Fuploads%2Fakm7aB0RwRdEF8thCQnN%2Fimage.png?alt=media&#x26;token=73052d51-7953-479c-9bf1-f89a31691987" alt=""><figcaption></figcaption></figure>

***

### Benefits

#### Speed and Throughput

1. **Wafer-scale inference**: Cerebras processes models on a single chip, eliminating inter-chip communication overhead that slows GPU clusters.
2. **Low latency for real-time applications**: High token-per-second throughput makes Cerebras well-suited for chatbots, live document generation, and interactive pipelines.
3. **Consistent performance under load**: The hardware architecture maintains throughput without the variability common in shared GPU infrastructure.

#### Model Selection

1. **Llama 4 Scout 17B**: A compact, instruction-tuned model optimized for speed-critical tasks where low latency matters most.
2. **Llama 3.3 70B**: A high-capability model balancing quality and speed for complex reasoning, summarization, and generation tasks.
3. **Llama 3.1 8B**: The lightest option for high-volume, cost-sensitive pipelines with straightforward generation requirements.

#### Ease of Use

1. **Direct API key authentication**: Connect with a single API key - no OAuth flow or additional setup required.
2. **Works in LLM node and action node**: Use the standard LLM node for quick setup or the dedicated Cerebras action node when you need granular parameter control and structured outputs.
3. **Managed connection available**: Stack AI provides an organization-level managed connection, so your team can use Cerebras without each member supplying their own key.

***

### How It Works

#### Authentication Flow

* Stack AI stores your Cerebras API key as an encrypted credential in your organization's connection vault.
* At workflow runtime, Stack AI retrieves the credential and attaches it as a Bearer token on each request to `https://api.cerebras.ai/v1`.
* The connection health check validates your key by listing available models before any workflow execution.

#### Request Execution

* When a Cerebras node fires, Stack AI constructs a chat message from your `prompt` input, applies your generation parameters (`temperature`, `max_tokens`, `top_p`, `frequency_penalty`, `presence_penalty`), and sends the request to the Cerebras API.
* The response is parsed and surfaced as structured outputs: generated text, finish reason, model identifier, and token usage counts.
* Finish reasons are normalized across providers: `stop` (natural completion), `length` (hit `max_tokens`), `tool_calls`, or `error`.

#### LLM Node vs. Action Node

* The **LLM node** uses a shared provider interface — select Cerebras from the provider dropdown, pick a model, and the node handles the rest. Best for standard conversational or generation patterns.
* The **Cerebras action node** exposes all generation parameters explicitly and outputs structured fields including token counts. Best when your downstream nodes need precise control or usage data.

***

### Setting Up a Connection

#### Step 1: Get your Cerebras API key

<figure><img src="https://3697023207-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FFSlso1Kjob5CLDrh0dVn%2Fuploads%2F7koAEzh8p4wEP4DWqjVV%2Fimage.png?alt=media&#x26;token=50c692d9-c58b-40e8-ada0-46a837308548" alt=""><figcaption></figcaption></figure>

Navigate to [cloud.cerebras.ai](https://cloud.cerebras.ai/), sign in to your account, and go to the **API Keys** section. Create a new key and copy it — you will not be able to view it again after leaving the page.

#### Step 2: Open the Connections page in Stack AI

In your Stack AI workspace, navigate to **Settings > Connections**. Click **New Connection** and search for or select **Cerebras** from the provider list.

#### Step 3: Enter your API key and save

<figure><img src="https://3697023207-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FFSlso1Kjob5CLDrh0dVn%2Fuploads%2FNoh8Y8wrAvJCeIlgef6B%2Fimage.png?alt=media&#x26;token=8b394257-c71c-4543-ab08-1486be1f96c6" alt=""><figcaption></figcaption></figure>

Paste your API key into the **API Key** field. Click **Test** to verify the connection - Stack AI will call the Cerebras API to confirm your key is valid. Once the test passes, click **Save**. Your connection is now available to all workflows in your organization.

#### Step 4: Verify the connection

After saving, the connection appears in your **Connections** list. The health-check indicator confirms Stack AI can reach the Cerebras API with your key.

***

### Available Models

| Model ID                         | Display Name      | Best For                                                                     |
| -------------------------------- | ----------------- | ---------------------------------------------------------------------------- |
| `llama-4-scout-17b-16e-instruct` | Llama 4 Scout 17B | Low-latency tasks, real-time applications, high-throughput pipelines         |
| `llama3.3-70b`                   | Llama 3.3 70B     | Complex reasoning, summarization, structured generation, default general use |
| `llama3.1-8b`                    | Llama 3.1 8B      | High-volume, cost-sensitive pipelines with simple generation requirements    |

***

### Using Cerebras in a Workflow

#### Option A - LLM Node

1. Add an **LLM** node to your workflow canvas.
2. In the node configuration panel, set **Provider** to **Cerebras**.
3. Select a model from the **Model** dropdown.
4. Connect your prompt input and configure any standard LLM node parameters.
5. Wire the node output to downstream nodes and run.

#### Option B - Cerebras Action Node

1. Add a **Cerebras** action node from the integrations panel — search for "Cerebras" in the node sidebar.
2. Select **Text Completion** as the action.
3. Connect or configure the input parameters described below.
4. Map the output fields to downstream nodes.

**Input Parameters**

| Parameter           | Type   | Required | Default                          | Range      | Description                                                             |
| ------------------- | ------ | -------- | -------------------------------- | ---------- | ----------------------------------------------------------------------- |
| `model`             | Select | Yes      | `llama-4-scout-17b-16e-instruct` | -          | The model to use for generation                                         |
| `prompt`            | String | Yes      | `1.0`                            | 0.0 – 2.0  | Controls output randomness; higher values produce more varied responses |
| `temperature`       | Number | No       | `1000`                           | 0.0 – 2.0  | Maximum number of tokens to generate                                    |
| `top_p`             | Number | No       | `1.0`                            | 0.0 – 1.0  | Lower values restrict to higher-probability tokens                      |
| `frequency_penalty` | Number | No       | `0.0`                            | -2.0 – 2.0 | Reduces repetition of tokens that appear frequently in the output       |
| `presence_penalty`  | Number | No       | `0.0`                            | -2.0 – 2.0 | Reduces repetition of any token that has already appeared in the output |

**Output Parameters**

| Parameter                 | Type   | Description                                                        |
| ------------------------- | ------ | ------------------------------------------------------------------ |
| `content`                 | String | The generated text response                                        |
| `finish_reason`           | String | Why generation stopped: `stop`, `length`, `error`, or `tool_calls` |
| `model_used`              | String | The model identifier used for the request                          |
| `usage_total_tokens`      | Number | Total tokens consumed (prompt + completion)                        |
| `usage_prompt_tokens`     | Number | Tokens used by the input prompt                                    |
| `usage_completion_tokens` | Number | Tokens used by the generated response                              |

***

### Best Practices

* **Use `llama3.3-70b` as your default starting point.** It is the default model in the LLM node and offers the best balance of capability and speed for most use cases.
* **Lower `temperature` for factual or structured outputs.** Values between `0.0` and `0.4` produce more deterministic responses suited to extraction, classification, and data transformation tasks.
* **Set `max_tokens` explicitly for pipeline reliability.** An unbounded token count can cause unexpected `length` finish reasons downstream; set a value appropriate for your expected output length.
* **Use `frequency_penalty` and `presence_penalty` together to reduce repetition.** For long-form generation, values between `0.3` and `0.8` on both fields help maintain output variety without degrading coherence.
* **Use the action node when token usage matters.** The `usage_total_tokens`, `usage_prompt_tokens`, and `usage_completion_tokens` outputs let you track costs and enforce budget limits in your workflow logic.
* **Prefer the managed connection for team deployments.** Using the organization-level Stack AI managed connection avoids credential sprawl and ensures your team shares a single auditable connection.

***

### Summary

Cerebras brings wafer-scale inference speed to Stack AI workflows, making it a strong choice for latency-sensitive applications and high-throughput pipelines. Connect once with your API key, then access Llama 4 Scout 17B, Llama 3.3 70B, or Llama 3.1 8B through either the LLM node for quick setup or the Cerebras action node for full parameter control and structured output.
