top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Understanding and Creating a Sankey Chart in Tableau: A Comprehensive Guide - 1


In today’s data-driven world, visualizing complex relationships is more crucial than ever. Did you know that over 80% of decision-makers find data visualization essential for understanding and communicating insights effectively? Among the myriad of visualization techniques, Sankey charts stand out for their unique ability to illustrate the flow of information or resources between categories.


A Sankey chart is a type of flow diagram that visually represents the flow of data between a source dimension and a target dimension. In this chart, the width of the curves is proportional to the quantity or flow rate being represented, allowing viewers to easily understand the relationships and transitions between different categories. Sankey charts are particularly effective for illustrating how values move between groups, such as sales moving from regions to customer segments or resources flowing between different processes.


This guide will show you how to build a Sankey chart in Tableau, covering the creation of essential calculated fields like ToPad, T, Rank1, Rank2, Curve, and the Sigmoid function for smoother curves.



Step 1: Data Union as the Primary Step for a Sankey Chart


Unioning tables is crucial to creating a Sankey chart because it allows us to duplicate data, ensuring that we have enough points for smooth curves between categories.


Scenario Setup


We have three tables in the data source:

  • Parent Table: Orders

  • Child Table 1: People

  • Child Table 2: Returns


The Parent Table (Orders) forms relationships with Child Table 1 (People) and Child Table 2 (Returns). We’ll union the Parent Table (Orders) to itself to generate duplicates that can be used in the Sankey chart.


How to Union Tables in Tableau


Data Source Tab: Drag the Orders table onto the canvas.



Establish Relationships: Before unioning, create necessary relationships with related tables (like People and Returns).



Unioning: Right-click on the Orders table (the base table) and click "Convert to Union".



Drag the Orders table again to the Union Window that appears.



Click "OK" and now You should see the table name listed below.




Remember, only the base table should be unioned, not the related tables.


This method ensures cleaner data management and maintains the integrity of your visualizations!


Step 2: Create Calculated Fields for the Sankey Chart


Once the union is done, you will need to create several calculated fields to build the Sankey chart. Each calculated field serves a specific role in generating the bar charts and smooth curves.


1. ToPad Field


Purpose: Creates multiple data points by duplicating rows in the source dimension’s dataset, ensuring smoother curves.


Example Calculated Field Syntax for ToPad (Sales between Region and Segment):

IF [Region] = 'East' THEN 49

ELSEIF [Region] = 'West' THEN 49

ELSEIF [Region] = 'South' THEN 49

ELSEIF [Region] = 'Central' THEN 49

ELSE 0

END


If your dataset contains sales data between Regions (East, West, South, Central) and Segments (e.g., Consumer, Corporate, Home Office), the ToPad field creates 49 duplicates for each region. For example, if you have 10 unique rows for each region, the ToPad field will multiply these 10 rows by 49, resulting in a total of 490 data points for each region, which ensures smoother curves in the chart.


How Changing the Value Affects Visualization:


  • 49: Provides smooth curves, giving enough points for a visually appealing Sankey chart.

  • Increasing to 100: Creates even smoother curves by adding more data points, but may add visual complexity.

  • Reducing to 25: Reduces the number of points, potentially making the curves less smooth but simplifying the visualization.


The ToPad field ensures that sales between Regions and Segments are represented with smooth transitions. The field always works on the source dimension’s dataset, creating multiple points that facilitate the accurate visualization of relationships between source and target dimensions in the Sankey chart.


2. Creating Bins for ToPad


Creating a Bin:


After defining the ToPad field, the next step is to create a bin to manage how data points are grouped.


  • Right-click on the ToPad field and select Create > Bins.

  • Set the bin size to 1.


Impact of Bin Size:


  • Bin Size of 1: This allows each of the 49 duplicated points for each segment to be visualized individually, providing maximum granularity.

  • Larger Bin Sizes (e.g., 5 or 10): Larger bins reduce the granularity, grouping the data into broader intervals. While this may simplify the visualization, it can also obscure finer details of the flow.


Setting the bin size to 1 is generally recommended for Sankey charts, as it ensures each of the duplicated points from ToPad is clearly represented.


3. The T Variable


Purpose of the T Variable :

The T variable is used to normalize the data points generated by ToPad, ensuring that the values fall within a range of 0 to 1. This normalization is essential for creating smooth curves in the Sankey chart and visualizing relationships between categories accurately.


Types of T Variables:


  • Fixed T: A constant formula is applied to all data points. For example, the formula for fixed T is:


    T = [T] / TOTAL([ToPad])


    This divides each T value by the total sum of ToPad values, ensuring that each point is proportionate to the total.


    If you have 490 data points (per region) (from 10 unique values with 49 duplicates each ), the T variable will normalize each value. For instance, if one value of T is 45 and the total of ToPad is 4900, the normalized T value would be:


    T = 45 / 4900 ≈ 0.0092


    This process is repeated for all 490 values, resulting in a range of normalized values between 0 and 1. These normalized values allow for precise and smooth plotting of curves in the Sankey chart.


  • Dynamic T: The T value can change based on conditions, filters, or specific segments. This allows for flexibility in how the T value behaves depending on the dataset or visualization requirements.



4. The Rank1 and Rank2 Fields:


The Purpose of Rank:


In Sankey charts, rank is used to order the flow between two dimensions based on a measure, such as sales. Ranking is essential to ensure that the flow between the source and target dimensions is visually meaningful and accurately represents the data. In a Sankey chart, two ranks are needed—one for the source dimension and one for the target dimension—because we are visualizing the transition between two distinct entities.


These two ranks are necessary because Sankey charts show the flow from one group (source) to another (target), and without ranking both sides, the flow may not properly align, making the visualization unclear.


Example Calculated Field for Ranks:


RANK1 : (SUM([Sales]), 'desc')

RANK2  : (SUM([Sales]), 'desc')


How to Apply These Ranks


  • Rank1 orders the source dimension (Region), ensuring that regions with the highest sales appear at the top of the chart.

  • Rank2 orders the target dimension (Segment), ensuring that segments with the highest sales appear at the top of their group.


Using these ranks creates a more intuitive flow in the Sankey chart, where the size of the flow between the two ranks (Region → Segment) visually represents the sales contribution of each pair.


5. Sigmoid Field


The Sigmoid Function is crucial for creating smooth transitions in visualizations like Sankey diagrams. It provides a natural flow between dimensions, making it essential for accurate data representation.


Default Calculated Field Value:

Sigmoid = 1 / (1 + EXP(-[T]))


Output Range:


  • The sigmoid function typically outputs values ranging from 0 to 1, but in this application, we adapt it to range from -0.5 to 0.5.

  • At T=0T = 0T=0: The sigmoid value is approximately -0.5.

  • At T=0.5T = 0.5T=0.5: The value is around 0, representing the midpoint.

  • At T=1T = 1T=1: The sigmoid function reaches a value of 0.5.


Importance of the Range:


  • Balanced Flow: The transition from negative to positive creates a visually balanced appearance.

  • Smoothing Transitions: The characteristic S-shape ensures gradual changes, avoiding sharp angles in the Sankey chart.


Customization Options:


  • Adjusting the Formula: You can modify parameters, such as the steepness of the curve, to tailor the sigmoid characteristics to your dataset.


6. Curve Field


The Curve Field is derived from the Sigmoid Function and is essential for smooth transitions.


Default Calculated Field Value:

Curve = (1 / (1 + EXP(-12 * [T]))) - 0.5


Output Range:


  • The Curve Field also produces values from -0.5 to 0.5.

  • At T=0T = 0T=0: Approximately -0.5.

  • At T=0.5T = 0.5T=0.5: About 0, indicating the midpoint.

  • At T=1T = 1T=1: Reaches 0.5.


Importance of the Range:


  • Centered Flow: By transitioning through these values, the curve remains centered around 0, creating a balanced appearance.

  • Smoothing the Transition: The sigmoid shape ensures gradual rises and falls, avoiding sharp angles.


Customization Options:

  • Adjusting the Formula: Modify parameters to fit specific visualization needs, such as the steepness of the curve.


Field

Purpose

A note to Remember

ToPad

Duplicates data points for smoother transitions.

Generate code based on the source dimension

Bins for ToPad

Groups duplicated points for granularity.

Set bin size to 1

T Variable

Normalizes data points for smooth curve plotting.

Calculated Field T = [T] / TOTAL([ToPad]) , can be changed based on the user need

Rank1

Rank source dimension based on measure.

Create calculated field - any aggregation based on the measure

Rank2

Rank target dimension based on measure.

Create calculated field - any aggregation based on the measure

Sigmoid

Creates smooth transitions between dimensions.

Calculated Field Sigmoid = 1 / (1 + EXP(-[T])) , can be changed based on the user need

Curve

Derives values for natural flow between dimensions.

Calculated Field Curve = (1 / (1 + EXP(-12 * [T]))) - 0.5 , can be changed based on the user need


You can use different names for the calculated fields as long as they are clearly defined and understood in the context of your analysis. Consistent naming helps maintain clarity, especially when sharing your work with others. However, feel free to customize names to suit your specific project or preferences.


With a comprehensive understanding of the essential calculated fields required for creating a Sankey chart, you are now prepared to develop a basic visualization that effectively illustrates the flow between categories. In our next installment, we will explore a practical example, demonstrating how to apply these concepts using a real-world case study. Let’s embark on this journey together!

19 views0 comments

Kommentare

Mit 0 von 5 Sternen bewertet.
Noch keine Ratings

Rating hinzufügen
bottom of page