How to read file line by line and split each line with custom delimiter in C++

This is a post summarizing my experience working with C++ file I/O API and standard library on how to split a string with a delimiter.

I had two tasks, the first is to write a I/O that read files in the following format, each line is an edge between two nodes. The first line contains some data on the number of vertices and edges.

A good stack overflow post here http://stackoverflow.com/questions/236129/split-a-string-in-c summarizes a few approaches to the problem. This post shares my favorite approach in the case of (1) reading graph files (2) reading database files

As a result, we do something special for the first line. With all my experience, using the getline api and the istringstream iss(line) is very useful. In this case, we are using the default delimiter of white space. Each getline returns an entire line, and using iss stream we directly stream the string into string, ints (implicit type conversion included). This is much simpler than the syntax of other approaches. Basically you can combine split and assignment of src, dst all in one line.

Example input file looks like the following

p sp 4 7

a 1 2

a 1 3

a 1 4

a 2 3

a 2 4

a 3 4

a 4 2

The code segment is shown below


if (infile.is_open()){
 while ( getline (infile,line)){
 istringstream iss(line);
 if (isFirstLine) {
 //a,b are useless information in the line
 string a,b;

if(!(iss >> a >> b >> numVertices >> numEdges)) {
 cout << "error in parsing line: " << line << '\n';
 break;
 }
 isFirstLine = false;
 adjList = new vector<int>[numVertices]; //vertex started from 1 instead of 0
 // used for debugging I/O code
 cout << "number of vertices: " << numVertices << '\n';
 cout << "number of edges: " << numEdges << '\n';
 } else {
 // a is useless information in the line
 string a;
 int src,dst;
 if(!(iss >> a >> src >> dst)) {
 cout << "error in parsing line: " << line << '\n';
 break;
 }
 src = src -1;
 dst = dst -1; //start from 0
 adjList[src].push_back(dst);
 // used for debugging I/O code
 // cout << "src: " << src << '\n';
 // cout << "dst: " << dst << '\n';
 }
 }//end of while loop
 infile.close();

// for (int i = 0; i < numVertices; i++) { //deal with index starting from both 0 and 1
 // vector<int> nghList = adjList[i];
 // cout << "node: " << i << '\n';
 // for (int j = 0; j < nghList.size(); j++) {
 // cout << nghList[j] << " ";
 // }
 // cout << '\n';
 // }
 } else cout << "Unable to open file";

 

Another approach is to use the “ifstream”

Here is a code snippet from GAPBS, a graph benchmark suite by Scott Beamer

https://github.com/sbeamer/gapbs

std::ifstream file(filename_);
if (!file.is_open()) {
  std::cout << "Couldn't open file " << filename_ << std::endl;
  std::exit(-2);
}
EdgeList ReadInEL(std::ifstream &in) {
  EdgeList el;
  NodeID_ u, v;
  while (in >> u >> v) {
    el.push_back(Edge(u, v));
  }
  return el;
}

 

A second scenario is that when you are reading data base tbl files (in the case of tpch queries, everything is of type .tbl ). Essentially we want to create a python or java equivalent of  String[] stringArray = string.split(‘|’). In those format, each row in the .tbl file is a row in the database table. The delimiter is usually ‘|’ instead of white space because the content of a column could potentially include whitespace. This poses a more challenging issue than the first scenario for two reasons

(1) Have to adjust to a custom delimiter

(2) Have a large number of items each line, making the one line syntax “iss >> a >> src >> dst” undesirable. If you make a mistake on  one entry, than the entire line fails.

To deal with this issues, we first notice that getLine actually has a delimiter option per documentation here http://www.cplusplus.com/reference/string/string/getline/ . The only problem is that if you use it that way, it will no longer be getting a line, it will end at the first appearance of the delimiter. Even with these complications, this is still the best API I come across that does this split string problem.

Following the example in the stack overflow post, I created the split function implementation that returns a vector of string (I basically combined the two functions given in the stack overflow post)

std::vector&lt;std::string&gt; split(const std::string &amp;s, char delim) {
 std::vector&lt;std::string&gt; elems;
 std::stringstream ss(s);
 std::string item;
 while (std::getline(ss, item, delim)) {
 elems.push_back(item);
 }
 return elems;
}

In the overall code, we first use getline without delimiter to actually get the line. Then for each line, we split it using the function above and assign to appropriate values. The I/O code segment looks like the following

if (infile.is_open()){

 while ( getline (infile,line)){
 LineItem * lineitem = new LineItem();
 std::vector&lt;std::string&gt; columns = split(line, '|');
 lineitem-&gt;l_orderkey = atoi(columns[0].c_str());
 lineitem-&gt;l_partkey = atoi(columns[1].c_str());
 lineitem-&gt;l_suppkey = atoi(columns[2].c_str());
 lineitem-&gt;l_linenumber = atoi(columns[3].c_str());
 lineitem-&gt;l_quantity = convertDecimal(columns[4]);
 lineitem-&gt;l_extendedprice = convertDecimal(columns[5]);
 lineitem-&gt;l_discount = convertDecimal(columns[6]);
 lineitem-&gt;l_tax = convertDecimal(columns[7]);
 lineitem-&gt;l_returnflag = *columns[8].c_str();
 lineitem-&gt;l_linestatus = *columns[9].c_str();
 lineitem-&gt;l_shipdate = convertDate(columns[10]);
 lineitem-&gt;l_commitdate = convertDate(columns[11]);
 lineitem-&gt;l_receiptdate = convertDate(columns[12]);
 lineitems.push_back(*lineitem);
 }
 infile.close();
 } else std::cout &lt;&lt; "Unable to open file";
Advertisements
This entry was posted in Tools and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s