Login

So, if you're just joining in to this series, this is part 8 in a tutorial series about building an ARM assembler. I advise that you read the previous parts, otherwise you may not understand. You can find the full list at

[To see links please register here]

.

Ok, so in the last part, we wrote some helper methods for initialization and conversion. At this point, it's possible to fill our structures with instructions. Now, in this part, we're going to start the amazing fun that is language parsing.

I'm going to split the parsing up into parts. In this one, we're just going to build structures that define the opcodes, conditions, op-lists, registers, etc. We will later use these structures to help us parse the syntax and make sure your code is error free when you write your first assembly program. Doing this is vastly superior to straight parsing, because we can define the entire language up front, and have a single parse function that does it all for us, and that we don't have to modify every time we need to tweak something.

At this point, your project should look like this:
[Image: JYVyWcT.png]

That's good, we want to keep everything organized so that we can find it easily. Let's go ahead and create a file in /Headers called language.h. We'll write all of our code in here and worry about organizing it in the end (since most of this is coming straight from my brain and tbh I have no idea what code we're going to write just yet). Side note: it's a good thing that I didn't plan this as much as I would have in a production environment, because I can guarantee that by the time we finish this, there will be at least 5 good bugs that we will have to fix. This will also give you a good intro to hardware design and reverse engineering, and prevent people from leaching all these parts before its done.

Ok, so you've created language.h. C::B will autogen some file contents, which is good, we want that. If you're using another IDE, it should look like this:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

the first thing I want to define is an enumeration that defines the types of instructions that exist for this assembler. These will aid in knowing which structure in our union to fill. Remember, we are only supporting 3 distinct instruction classes:

Data Processing
Single Data Transfer
Branch

Alright, stop. You probably failed that quiz (as did I). We already made that enumeration, it's in instruction.h. We're going to go ahead and leave that in place.
So, our next task, is to define a structure that holds our registers. This is going to be pretty straight forward, since ARM uses a numbered sequence rather than x86's lettered sequence. Let's go ahead and define that structure:

So, we're going to need a string pointer for the register name, and an integer holding the register's number. Now, if you remember from our earlier parts, the registers are numbered 0-15. This means that we want to max out our data storage to that amount. Although this won't actually make a difference, it's good practice and a good example of self-documenting code to limit this to a nibble. We will also need to add stdint.h to this file, so that we have access to integer definitions in the exacted format.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Cool. Now, let's define a structure that will hold our language syntax. To do that, we first have to define the individual tokens that will make up our language. This will be an enumeration.
Now, we know at this point, that our tokens are this:

mnemonic
register
constant
expression

So, before we really get into what those mean, let's just go ahead and create our enumeration:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok, let me explain those. So, half of these are pretty straight forward, I'll go over those first.
A register is directly mapped to a CPU register, we will use our struct language_register to help parse these.
A constant is literally that, its a constant value. We don't distinguish between bases in this enumeration (but we will).
A mnemonic is where things get neat. So, we need a way to parse both opcodes and conditional flags. The combination of these two is the mnemonic. We will split these later on.
Finally, an expression. This will be useful for memory accessors, or just plain old "somebody put an explicit calculation in the field". Expressions will either evaluate to another expression, or a constant.

Ok cool. Now that we've gone over that, let's go ahead and modify our enumeration to hold the derrived types as well:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Sweet. Now, we technically have all of the tools that we need to start building our language structure. Let's go ahead and start that. For this, we're going to have (to start)

The name of the rule (for debug logging)
An array of type enum language_token (to hold our ORDERED token syntax list)

Ok, so the code for that should look like this:
[/code][/hide]
struct language_rule
{
char *name;
enum language_token *syntax;
};
[/code][/hide]
Now, since we're dealing with arrays, we will need some form of NULL terminator. Right now, we don't have any sort of thing in our enumeration, let's go ahead and add that. When you're done, the entire file should look like this:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok. so now, we can make up some of the basic syntax, but we're still at a loss if the line should contain specific characters. To do that, we're going to have to define another enumeration value for a hard coded constant character. That one's pretty simple, just add

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

into your list somewhere. I'm going to put it above the basic tokens list, since these are something we're going to define.

Awesome. Now we come to the third thing that our structure needs. When we define our array of tokens, we're going to have a bunch of kTOKEN_CHARACTER entries in there indicating where we need spaces, comas, brackets, etc. But as it stands right now, we have no way of defining what those characters are. Since they will be used to determine how far we read when searching for a token, we're going to also need a character array to set those. Let's add that to our structure.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok, so now we need the fourth thing. And this is where it gets a little tricky, we need to add a segment to our structure that defines the type of instruction this references.. As you remember, this is defined in instruction.h. This poses a bit of a design problem, because it IS used in that file. We can either leave it in place and include instruction.h in our file, or move it to our file and include that in instruction. For now, we're going to leave them in place, although we may need to move this later on. Let's go ahead and include instruction, and then add an entry in our structure for that. Afterwards, your file will look as follows:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

At this point, we also need to define what mnemonics we allow for this rule. This means both opcode and condition. We can do this one of two ways.
1. we can write one rule for each mnemonic
2. we can write one rule for a group of mnemonics
Just to make sure this code is shorter, we're going to go with option 2. This means that we need an enumeration that is properly segmented, to define all of our opcodes. Let's name that enum language_opcode and give it a prefix of kOPCODE_.
Go ahead and fill our our enum using the previous parts of this series as a guide.
ok, here's my code. I know I included more opcodes than our guide historically has, I just want to make sure I covered everything I could

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, you may ask the question: "do we have to segment the condition enumeration?" Well, no. All instructions are conditional and all instructions can carry every condition, so we don't need to define allowed conditions at all. Now, let's go ahead and add an INTEGER argument for our structure that allows us to define the allowed instructions. When complete, your file will look like this. (sorry I keep posting this file, it's just super important that you get this one right).

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Cool. Now, just for shits and giggles, I'm going to write a quick rule that would define an ADD instruction:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Note: DO NOT put this in your project, this is just for reference
Ok, so I'm gladd I tested this. I want to reorder the arguments so that the characters come immediately after the syntax. My new language_rule structure looks like this

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

and my new rule looks like this:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now that all sounds pretty straight forward, but it has just occurred to me that we won't be able to sense end of line, and the following is NOT valid:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Can you tell what's wrong? Probably not. There are two spaces between ADD and R0. We didn't define this, and we forced it to be a space rather than a tab or whatever. Let's go ahead and write our assembler to ignore whitespace. If we do this, we can remove the character token between opcode and register, and of course the space from our list. This would make our EXAMPLE instruction rule look like this:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok. So with that done, and with you a little bit smarter, we have to define a couple of other things before we end this part (since it's already too long).

First of all, we're going to rename language_register and then add a couple other structures. the code is as follows:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok, now we have a problem. We've used enum language_opcode before we've defined it. It's time to start organizing things. Go ahead and create a folder in /Headers called language, and add a file there called parsing.h. This file is currently blank:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Go ahead and paste those three structures into this file. Don't worry about includes on this one.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, we don't have to do this part, but I think it will look neater if we do. Let's also make a file language/enumerations.h and of course, paste our enumerations inside that.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, our original language.h is pretty empty. It looks like this:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

but it's missing a few things. We need to include our files that we just created. Order IS important in this.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok. Now we have the basics down. The next step we need to do is define all of the actual language rules. To make this simple, we're going to create a few constants, and place them in language/constants.h. These constants will create logical groups of our opcodes, so that we don't have to make a giant list of them when we do things. I'll just do this for you, since it's pretty straight forward. The entire file should look like this:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

and we will include that in language.h as well. NOTE: we will be adding more of these, this is just the bare minimum.

Finally, before I let you go, I'm going to have you create the basic list of instruction rules. We won't actually write any rules today, but we're going to make the list just so I don't forget how to do it right.
Go ahead and create /Sources/language/rules.c and paste this code in it:

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Ok, let's go ahead and try to build that
[Image: khFucCe.png]

Ok, looks like we forgot a coma here. Good. That will keep you on your toes so that you don't just breeze through this and copy paste all my code. Go ahead and add the missing coma on kTOKEN_EXPRESSION (if you didn't already catch my mistake) and try to build it again

[Image: GtpXl9e.png]

Ok, so this isn't an error, and it's not technically wrong, but we're going to change it anyways just to get rid of that pesky warning.

Hidden Content

You must

[To see links please register here]

or

[To see links please register here]

to view this content.

Now, rebuild it
[Image: d82TT9O.png]

PERFECT!

Ok, before I let you go, I'll give you the ending project tree screenshot.
[Image: P9KCJnG.png]

Sweet. Don't forget to save all the files, save your project, good luck and see you next time, when we get to write all of the rules, and maybe make some changes to this.

Please don't forget to discuss and ask questions. Think you know how to do it better? Let me know, you're probably right!

Mrswap393