Parsing Log Entries
Our goal for this chapter is to parse the following log data into structured values. This is taken from a nice tutorial over at FP Complete.
val logData =
"""|2013-06-29 11:16:23 124.67.34.60 keyboard
|2013-06-29 11:32:12 212.141.23.67 mouse
|2013-06-29 11:33:08 212.141.23.67 monitor
|2013-06-29 12:12:34 125.80.32.31 speakers
|2013-06-29 12:51:50 101.40.50.62 keyboard
|2013-06-29 13:10:45 103.29.60.13 mouse
|""".stripMargin
We’ll use the same imports as before.
import atto._, Atto._
import cats.implicits._
This data contains IP addresses, which in turn contain unsigned bytes. So our first order of business is figuring out how to parse these.
Parsing Unsigned Bytes and IP Addresses
IP addresses contain unsigned bytes, which we don’t have in Scala. So the first thing we’ll do is create a data type wrapping a signed byte, and then write a parser for it.
case class UByte(toByte: Byte) {
override def toString: String = (toByte.toInt & 0xFF).toString
}
val ubyte: Parser[UByte] = {
int.filter(n => n >= 0 && n < 256) // ensure value is in [0 .. 256)
.map(n => UByte(n.toByte)) // construct our UByte
.namedOpaque("UByte") // give our parser a name
}
It works!
scala> ubyte.parseOnly("foo")
res0: atto.ParseResult[UByte] = Fail(foo,List(),Failure reading:UByte)
scala> ubyte.parseOnly("-42")
res1: atto.ParseResult[UByte] = Fail(,List(),Failure reading:UByte)
scala> ubyte.parseOnly("255") // ok!
res2: atto.ParseResult[UByte] = Done(,255)
We can now define our IP
data type and a parser for it. As a first pass we can parse an IP address in the form 128.42.30.1 by using the ubyte
and char
parsers directly, in a for
comprehension.
case class IP(a: UByte, b: UByte, c: UByte, d: UByte)
val ip: Parser[IP] =
for {
a <- ubyte
_ <- char('.')
b <- ubyte
_ <- char('.')
c <- ubyte
_ <- char('.')
d <- ubyte
} yield IP(a, b, c, d)
It works!
scala> ip parseOnly "foo.bar"
res3: atto.ParseResult[IP] = Fail(foo.bar,List(),Failure reading:UByte)
scala> ip parseOnly "128.42.42.1"
res4: atto.ParseResult[IP] = Done(,IP(128,42,42,1))
scala> ip.parseOnly("128.42.42.1").option
res5: Option[IP] = Some(IP(128,42,42,1))
The <~
and ~>
combinators combine two parsers sequentially, discarding the value produced by
the parser on the ~
side. We can factor out the dot and use <~
to simplify our comprehension a bit.
val dot: Parser[Char] = char('.')
val ip: Parser[IP] =
for {
a <- ubyte <~ dot
b <- ubyte <~ dot
c <- ubyte <~ dot
d <- ubyte
} yield IP(a, b, c, d)
And it still works.
scala> ip.parseOnly("128.42.42.1").option
res6: Option[IP] = Some(IP(128,42,42,1))
We can name our parser, which provides slightly more enlightening failure messages
val ip2 = ip named "ip-address"
val ip3 = ip namedOpaque "ip-address" // difference is illustrated below
Thus.
scala> ip2 parseOnly "foo.bar"
res7: atto.ParseResult[IP] = Fail(foo.bar,List(ip-address),Failure reading:UByte)
scala> ip3 parseOnly "foo.bar"
res8: atto.ParseResult[IP] = Fail(foo.bar,List(),Failure reading:ip-address)
Since nothing that occurs on the right-hand side of our <- appears on the left-hand side, we don’t actually need a monad; we can use applicative syntax here.
val ubyteDot = ubyte <~ dot // why not?
val ip4 = (ubyteDot, ubyteDot, ubyteDot, ubyte).mapN(IP.apply) named "ip-address"
And it still works.
scala> ip4.parseOnly("128.42.42.1").option
res9: Option[IP] = Some(IP(128,42,42,1))
We might prefer to get some information about failure, so either
is an, um, option.
scala> ip4.parseOnly("abc.42.42.1").either
res10: Either[String,IP] = Left(Failure reading:UByte)
scala> ip4.parseOnly("128.42.42.1").either
res11: Either[String,IP] = Right(IP(128,42,42,1))
Ok, so we can parse IP addresses now. Let’s move on to the log.
Parsing Log Entries
Here are our log entries defined in logData
above.
2013-06-29 11:16:23 124.67.34.60 keyboard
2013-06-29 11:32:12 212.141.23.67 mouse
2013-06-29 11:33:08 212.141.23.67 monitor
2013-06-29 12:12:34 125.80.32.31 speakers
2013-06-29 12:51:50 101.40.50.62 keyboard
2013-06-29 13:10:45 103.29.60.13 mouse
And some data types for the parsed data.
case class Date(year: Int, month: Int, day: Int)
case class Time(hour: Int, minutes: Int, seconds: Int)
case class DateTime(date: Date, time: Time)
sealed trait Product // Products are an enumerated type
case object Mouse extends Product
case object Keyboard extends Product
case object Monitor extends Product
case object Speakers extends Product
case class LogEntry(entryTime: DateTime, entryIP: IP, entryProduct: Product)
type Log = List[LogEntry]
There’s no built-in parser for fixed-width ints, so we can just make one. We parse some number of digits and parse them as an Int
, handling the case where the value is too large by flatmapping to ok
or err
.
def fixed(n:Int): Parser[Int] =
count(n, digit).map(_.mkString).flatMap { s =>
try ok(s.toInt) catch { case e: NumberFormatException => err(e.toString) }
}
Now we have what we need to put the log parser together.
val date: Parser[Date] =
(fixed(4) <~ char('-'), fixed(2) <~ char('-'), fixed(2)).mapN(Date.apply)
val time: Parser[Time] =
(fixed(2) <~ char(':'), fixed(2) <~ char(':'), fixed(2)).mapN(Time.apply)
val dateTime: Parser[DateTime] =
(date <~ char(' '), time).mapN(DateTime.apply)
val product: Parser[Product] = {
string("keyboard").map(_ => Keyboard : Product) |
string("mouse") .map(_ => Mouse : Product) |
string("monitor") .map(_ => Monitor : Product) |
string("speakers").map(_ => Speakers : Product)
}
val logEntry: Parser[LogEntry] =
(dateTime <~ char(' '), ip <~ char(' '), product).mapN(LogEntry.apply)
val log: Parser[Log] =
sepBy(logEntry, char('\n'))
It works!
scala> (log parseOnly logData).option.foldMap(_.mkString("\n"))
res13: String =
LogEntry(DateTime(Date(2013,6,29),Time(11,16,23)),IP(124,67,34,60),Keyboard)
LogEntry(DateTime(Date(2013,6,29),Time(11,32,12)),IP(212,141,23,67),Mouse)
LogEntry(DateTime(Date(2013,6,29),Time(11,33,8)),IP(212,141,23,67),Monitor)
LogEntry(DateTime(Date(2013,6,29),Time(12,12,34)),IP(125,80,32,31),Speakers)
LogEntry(DateTime(Date(2013,6,29),Time(12,51,50)),IP(101,40,50,62),Keyboard)
LogEntry(DateTime(Date(2013,6,29),Time(13,10,45)),IP(103,29,60,13),Mouse)