Basic Xcalar concepts

This section explains basic concepts you must understand to use Xcalar Design efficiently.

Understanding the purpose of Xcalar Design

Xcalar Design is an HTML5-based visual tool that enables you to interactively and intuitively design algorithms through elementary operations. By manipulating the values in the tables presented in the graphical user interface, you can clean and consolidate your data before performing data analysis. You can also publish data from Xcalar Design to a Jupyter Notebook, an open-source, web-based application, for further data analysis or visualization.

When you use Xcalar Design interactively, it generates a dataflow to show a sequence of operations leading up to each table.

If you use Xcalar in operational mode, you can save dataflows, which can run on demand or at scheduled intervals. If you use Xcalar in modeling mode, you can save dataflows but you cannot run them. To run a saved dataflow, use Xcalar Design to download it and then upload it to a Xcalar cluster with an operational mode license.

Xcalar Design can import data that is unstructured or organized in arbitrary formats. The data sources can reside on a local or cloud storage system.

Understanding Xcalar Design login

Each Xcalar Design user can have one login at a given time. Closing the browser window does not automatically log you out of Xcalar Design. If you are logged in to Xcalar Design and you try to log in again (for example, from another browser tab), a message informs you that you are already logged in elsewhere. You can either go back to the login screen or continue the login, which causes Xcalar Design to terminate your other connection.

Login names are case-insensitive.

Shared vs. shared-nothing storage for Xcalar

Xcalar's True Data in Place algorithms enable you to store source data on shared or shared-nothing storage, without having to move your data in advance to attain maximum processing speed.

Comparison between the Xcalar and other architectures

Xcalar's unique architecture eliminates the need for data sharding, partitioning, or placement for node affinity, which is commonly required by other data analytics tools using software frameworks such as Hadoop and Spark. In these frameworks, you typically shard the data for locality of reference to a node where a particular algorithm is running. The placement of data occurs either at the start or when a MapReduce job begins to execute. Without proper placement, you cannot achieve optimal data processing speed.

Xcalar's nodes, however, read data in parallel and process it optimally irrespective of locality. Furthermore, while processing the data, Xcalar does not shuffle data as a MapReduce job would. Data has mass, and movement of terabytes of data for locality of reference at the start of processing, or later, during the shuffle phase, incurs huge processing and performance costs. Xcalar's capability to keep source files in their original location from beginning to end greatly simplifies deployment and improves data processing performance.

Examples of Xcalar using all cores for parallel I/O

In this example, a 4-node cluster with 32 cores processes files in parallel without requiring the files to be moved from one storage space to another or redistributed among nodes. When you use Xcalar Design to import 1,024 files, the nodes of the cluster read the files in batches of 32, utilizing the full parallel I/O bandwidth from all the cores. Files may be read asynchronously or synchronously, where the former option may provide higher throughput.

The following figure illustrates the cluster processing 1,024 files.

Suppose you use the same cluster to import 1,025 files. Again, the 32 cores read 32 files at a time. The last file is processed in its entirety by only one core.

Similarly, if you only need to import one file, only one core is used to process the file.

In cases where only one core processes a source file, Xcalar does not arbitrarily break the file into chunks to parallelize I/O as that would violate the sequential integrity of the data. Remember that each core reads a file in its entirety; the core never reads a partial file. Therefore, if you want to take advantage of the bandwidth of all cores, you must split the source file into multiple smaller files in a way that can be understood by your logic when you perform modeling later on.

Understanding workbooks and worksheets

You can create workbooks in Xcalar Design, which can be regarded as files. Within a workbook, you can create as many worksheets as you want. Worksheets in the same workbook usually contain data related to the same project. The worksheet displayed on your screen is the active worksheet.

At any time, only one workbook can be active. You can deactivate or activate a workbook as desired.

The relationship between a worksheet and a workbook is similar to the relationship between a Microsoft Excel worksheet and workbook.

The following sample screen shows a Worksheet window for a workbook titled MyNewWorkbook.

Understanding data sources, datasets, and tables

The raw data you want to analyze can be in one file, or multiple files in the same directory or different directories. The term data source refers to the file, files, or directory containing the raw data. For example, your data source can be an HDFS file system that your Xcalar nodes can access or a file system NFS-mounted on each node of the cluster. Through reading the data at its source, Xcalar Design obtains the metadata to build a dataset. This process of reading the data and building the dataset is called importing a data source,

NOTE: For modeling, Xcalar Design does not import all records in the data source. It only imports a sample to create the dataset. The sample size, however, must be greater than the smallest file in the data source. The sample size allowed is determined by the MaxInteractiveDataSize parameter, as described in Configuring parameters.

After you import data from a data source, you can pull fields of interest from the dataset to create tables in a Xcalar Design worksheet. More fields from the dataset can be added to an active table at any time. You can move tables between worksheets but not workbooks.

You can manipulate and transform data within a table (for example, by sorting or filtering) or manipulate multiple tables, which can be in different worksheets, to create a new table (for example, by joining). Use data operations to implement an algorithm that you design for deriving meanings from your data. With Xcalar's True Data in Place architecture, data is imported and analyzed without any ETL (Extract, Transform, and Load).

NOTE: All tables you interact with in Xcalar Design during the modeling phase are virtual tables. The tables are virtual because they are a concise representation of the operators, operands (the input and output for each operation), and the resulting schema. Only when you export a virtual table from Xcalar Design to an export target is a table projected and materialized. Product documentation and Xcalar Design use the term table, when modeling is described, to mean a virtual table.

The following diagram illustrates the relationships among data sources, datasets, and tables.

Understanding table names

Each table is created with a name unique in the workbook. Each table name consists of two parts, separated by the # sign. You can create and modify the first but not the second part, which is auto-generated. For example, a worksheet might contain the following tables:

  • airlines#uW120
  • airlines#uW122

You can modify the string before the # sign. The string uW is unique on the cluster, representing the user who created the table. The string is followed by a numerical value. In general, the numerical values indicate the chronological order in which the tables were created. In this example, airlines#uW120 was created before airlines#uW122.

Effects of a Xcalar Compute Engine restart on your workbook

After Xcalar Compute Engine is restarted, all workbooks become inactive. This means that you must re-activate a workbook to resume your work.

Re-activating a workbook in this scenario requires that Xcalar Design can access all data sources for creating datasets used in the workbook. The path names to the data sources must be the same as before so that Xcalar can rebuild the datasets in the same way as it did when you imported them the first time.

Data formats supported

Xcalar Design supports CSV, JSON, Excel, XML, and raw text without using a UDF (user-defined function). It supports other formats provided that you have created import UDFs that Xcalar Design can use when importing the data.

NOTE: The CSV format includes not only comma-separated values but also other values separated by alternate delimiters. The default field delimiter is tab and the default record delimiter is newline (\n). You can select the delimiter for separating fields and for separating records.

How Xcalar Design saves your work

Xcalar Design saves the results of all data operations as they are being completed. For example, after you filter a column, the resultant table is saved automatically. You do not have to click Save.

If your work is for manipulating the appearance of a table or a table column, Xcalar Design automatically saves your work in these situations:

  • It saves your work every two minutes. (You can change the time interval in the General Settings window as described in Changing the environment settings for Xcalar Design.
  • It saves your work when you sign out. However, before this automatic save, if you refresh your browser, your browser displays a message warning about unsaved changes. You must click Save to avoid losing the changes. The following screenshot shows the location of Save.

How to tell if there are unsaved changes

The following indicators show that there are unsaved changes in your workbook:

  • An asterisk displayed next to Save at the bottom of your browser window. If you hover over Save, you can see a tooltip showing the time of the last save.
  • A blue dot on the Xcalar icon in the browser tab.

To manually save changes, click Save as shown in the following screenshot.

EXAMPLE: After you hide the first table column and resize the second column, you can sign out without clicking Save. These changes are automatically saved. However, if you refresh the browser after making these changes instead of signing out, the browser warns that you have unsaved changes. If you proceed with the browser refresh, the table in the refreshed browser does not reflect the hiding of the first column and the new width of the second column.
IMPORTANT: If you use the same browser window to go to another website, you can use the browser's Back button to return to Xcalar Design. This causes the browser to reload the Xcalar Design page, and the result is similar to a browser refresh. This means that your changes are lost. Therefore, always click Save before you use the same browser window to navigate to another website.

Understanding what entities are shared by Xcalar users

Because the Xcalar cluster supports multiple users, be aware that your actions on shared entities might affect others. The following list describes the entities shared by all users:

  • Batch dataflow, including its associated parameters and schedule.
  • Data target.
  • Export target.
  • UDF (user-defined function).
NOTE: The user who creates a dataset can decide whether to share the dataset with all other users. By default, however, datasets are accessible only to the dataset owner.

In addition, memory is shared. The amount of memory used by one user affects the amount of memory available for other users.

The following list describes the entities that are not shared among Xcalar users:

  • Aggregate.
  • Table.
  • Workbook.
  • Worksheet.

In addition, the general settings that control Xcalar Design user interface elements are not shared.

Understanding the license key

To use all features offered by Xcalar Design, a valid license key must be entered during installation. Without a valid license, you cannot perform modeling by using operations such as filtering, finding aggregates, sorting, and joining. You can, however, import and export data.

If you cannot perform operations due to an invalid license, contact your Xcalar administrator as soon as possible. To ensure uninterrupted services, make sure that you have a valid license at all times.

If you are an administrator, you can update the license in the Setup panel of the Monitor. For more information about the Monitor, see Using Setup (Xcalar admin only).

Displaying Xcalar Compute Engine product version and license information

To display Xcalar product versions and the license expiration date, follow these steps:

  1. Click in the upper right corner of the Xcalar Design window.
  2. In the drop-down menu, click About to display a modal window that lists product versions, license expiration date, and copyright information.

    If the License Key Expiration is Unlicensed, the cluster does not have a valid license key, and you can perform only a limited number of operations.

Go to top