GSoC 2018: A Format-Preserving List Parser and the Madness
It’s been a week since the coding phase began. I’m doing my first task as MVP to figure out how to deal with the format-preserving parser. It is a format-preserving List parser. In this blog post, I would love to share my experience during the design phase to the implementation phase as well as the madness of this parser. Let’s be clear, this one is not finished yet. From this point I could tell that I need to think about another approach. I’m also think that my steps was not smart enough to begin with. Let’s go to the main topic!
The Design Phase
I have designed the lexical analyzer for this parser about a month ago, actually. However, I got blocked by college life such as final task and final exam. My final exam ended at May 17th, 2018. I re-learn some necessary stuff from Language and Automata Theory such as Deterministic Finite Automata (DFA) and then try to draw the DFA. Yep, I try to draw several of them and think about how to compose (concatenate) them.
In this part I also thinking about whether I use parser generator or parser combinator approach. Parser combinator could come handy but there will be a problem when dealing with the transformer function which transform parsed List into its original form. That’s why I came with parser generator approach. I need to think something clever for this one like how the data structure should be.
The Implementation Phase
The Lexical Analyzer
Building lexical analyzer (lexer) is not hard at all. What we need is just define the Token data structure and a function that transform List of Char to List of Token. Actually, we could use Alex for this but I decide to build my own since it would be easy for me to maintain. The easier way to build a lexer is to write several valid samples.
[1, 2, [3, ], 5]
As we can see, that’s a List with a nested List as its element. Suppose that our List will be either an integer or a nested List so the implementation won’t be difficult. As it is, the accepted character for our DFA will be
,0123456789\n\s. Then, define the token name. Naming the token is kinda hard when we are thinking a good name.
[ is OpenSquareBracket
] is CloseSquareBracket
0 until 9 is Literal
, is Comma
\n is EOL
\s is Whitespace
EOF as end-of-file flag
Well, my so called lexer is just a transformer from a List of Char to a List of Token. The result will be used to the
parse function which transform a List of Token to a List data structure by using the DFA as the “checker” whether it is accepted or not. The result will be look like this.
[OpenSquareBracket,Literal “1”,Comma,Literal “2”,Comma,Whitespace 1,OpenSquareBracket,Literal “3”,Comma,EOL,Whitespace 2,OpenSquareBracket,Literal “4”,CloseSquareBracket,CloseSquareBracket,Comma,Whitespace 1,Literal “5”,CloseSquareBracket,EOF]
To put it simply, the parser is just transform a List of Token to a List data structure. Actually, the parser using DFA starting from the Start State to the End (final/accepted) State. By using the DFA, we won’t encounter any problems as I belived. I was wrong! Well, I’m forgot that this List could be a nested List so it should have element “depth”. Say we want to parse
[1, 2, [3, ], 5], the result will be
List (Cons 1, Cons 2, List (Cons 3, List (Cons 4, Cons 5))) instead of
List (Cons 1, Cons 2, List (Cons 3, List (Cons 4)), Cons 5).
We could just fix the parser, actually. However, I had implemented the “transformer” partially which is stupid. I should be makes sure all testcase passed. That’s why I mentioned “madness” before. It is became extremelly hard since I got confused about whether should I fix the parser first or should I implement the “transformer” function till it is finished? What a stupid way to go. I will reflect to this next time I won’t make the same mistake in the future.
This is the hardest part of all. Just think about “how should I store some information so I could transform back the parsed List into its original form?” will hurting your brain. Moreover, what happens when you manipulate the List? How to transform it back? Let’s take a look to my approach.
data List = Cons List List [Token]
| Element String
| List List [Token]
| Nil [Token]
After think about it, I decide to add another information on my List data structure. Yes, it is the
[Token]. The idea is to store all necessary information such as
Whitespace. It is much easier to figure it out with an example. Suppose, there is a list looks like below.
, , [3, 4]
Could you imagine how the token looks like?
OpenSquareBracket, Whitespace (1), Literal "1", EOL,
Comma, Whitespace (1), OpenSquareBracket, Literal "2", CloseSquareBracket, EOL,
Comma, Whitespace (1), OpenSquareBracket, Literal "3", Comma, Whitespace (2), Literal "4", CloseSquareBracket, EOL,
Comma, Whitespace (5), Literal "5", EOL,
Then, could you imagine how the List looks like?
List (Cons 1, List (Cons 2), List (Cons 3, Cons 4), Cons 5)
How is it actually stored?
List (Cons 1 [Whitespace (1)], List (Cons 2 ) [EOL, Comma, Whitespace (1)] , List (Cons 3 , Cons 4 [Comma, Whitespace (2)]) [EOL, EOL, Comma, Whitespace (1)] , Cons 5 [EOL, Comma, Whitespace (5)])  [EOL, EOL]
To put it simply, the List will storing the tokens from its previous tokens. Say
, 4], before
Literal “4” there are
Whitespace (1). Yep, we store that information so we could transform it back to its original form. The
Cons token only storing all tokens from the previous one when the
List token will storing all tokens from the previous
[ token and the previous
] token. Say
[1,2 ], it will looks like
List (Cons “1” , Cons “2” [Comma])  [Whitespace (1)]. Too bad that I messed up the parser when I’m on the way to implement this.
The Next Plan
I have three weeks left before the first evaluation begin. My plan is to design the YAML lexical analyzer. I will also think several approaches. I will makes sure to write the test this time so I won’t messed up my work again. This is indeed hard but it is exciting as well.