Big Data Engineer Resume
3.00/5 (Submit Your Rating)
Reston, VA
SUMMARY
- Research and present potential solutions for current EDS platform in relation to data integration and visualization and reporting.
- Experience converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Able to work with team and cross - functionally to research and design solutions to speed up or enhance delivery within the current platform.
- Able to design and document the technology infrastructure for all pre-production environment and partner with technology Operations on the design of production implementations.
- Ability to conceptualize innovative data models for complex products, and create design patterns.
- Fluent in architecture and engineering of the Hadoop, Cloudera, Hortonworks, Amazon AWS, Azure, MapR Hadoop ecosystem.
- Hands on experience in coding MapReduce/Yarn Programs using Java, Scala for analyzing Big data.
- Skilled in the use of MapReduce, MapReduce jobs and generating tools like Pig or Hive.
- Worked with Apache Spark to provide fast engine for large data processing integrated with functional programming languages Scala, Python, and scripting in Hive QL and Pig Latin.
- Uses expert skills across a number of platforms and tools, and working with multiple teams in high visibility roles.
- Provide end-to-end data analytics solutions and support using Hadoop systems and tools on cloud services as well a on premise nodes.
- Expert in big data ecosystem using Hadoop, Spark, Kafka with column-oriented big data systems such as Vertica and Cassandra.
- Worked with various file formats (delimited text files, click stream log files, Apache log files, Parquet files, Avro files, JSON files, XML Files)
TECHNICAL SKILLS
- WS - EC2, SQS, S3, Azure, Google Cloud, Horton Labs, Rackspace
- Cloudera Hadoop, Cloudera Impala, Hortonworks Hadoop, MapR, Spark, Spark Streaming, Hive, Kafka, Nifi, Kinesis
- Apache Airflow, Apache Camel, Apache Flink/Stratosphere, Hive, Pig, Sqoop, Flume, Scala, Python database, SQL, No SQL, HDFS, Data Warehouse and Data Lakes
- Apache Cassandra, Datastax Cassandra, Apache Hbase, Apache Phoenix, BigSQL, Couchbase, DB2, MariaDB, MongoDB, MS Access, Oracle, RDBMS, SQL, SQL Server, Apache Toad
- ETL Architecture, Creation for Various Use Cases
- Apache Camel, Flume, Apache Kafka, Apatar, Atom, Fivetran, Heka, Logstash, Scriptella, Stitch, Talend, Ketl, Kettle, Jaspersoft, CloverETL
- Talend, Scriptella, KETL, Pentaho Kettle, Jaspersoft, Geokettle, CloverETL, HPCC Systems, Jedox, Apatar
PROFESSIONAL EXPERIENCE
Big Data Engineer
Confidential, Reston, VA
Responsibilities:
- Responsible for understanding of business rules, business logic, and use cases to be implemented.
- Data ingestion by using Python Kafka producer and Apache NiFi to send an XML string which has all transaction data required to be transform, analyze and store in Hbase.
- Integrated frameworks to match security rules like drools with Flink CEP patterns in a single dataStream.
- Worked with parallel tasks to provide high Throughput and Flink windows to hold data in memory and provide low latency response.
- Store and query data from Hbase using apache phoenix.
- Started Flink yarn-session to provide enough resources for all elements being process.
- Integrated the back-end with the front-end to create rule match dynamically.
- Worked with flink state, process functions, aggregator, coProcessFuntion and window function.
- Maven used for the managing the project lifecycle, SVN repository, log4j for logs.
- Developed POC to choose the best way to process and store data and be able to provide low latency response.
- Set Kafka brokers and topics with the proper replication factor and partitions to provide Exactly once Kafka semantics.
- XML schemas and POJOS to map the input request and send the response.
- Worked with Flink Event, process and ingestion time providing the Timestamp extractor, watermarks and allowing late data.
- Flink-YARN session setting the JobManager, taskManager and slots resources and be able to start our job with parallelism.
- Hbase Master and region servers, phoenix query servers working to store.
- Windows Azure cluster.
- Linux CentOS operating system.
- HDFS, Zookeeper and YARN to work with the cluster.
- Developed object-oriented Java application to create the main engine and python scripts as Kafka producer.
- Used SQL queries to get and upsert data into Phoenix Hbase tables.
- Stored encrypted data to the data base and certificates to connect via HTTPS and accomplish ISO 8583.
- Drools to create the template, drt files and start the KieSession to match all incoming transactions to the engine as a first step.
- Aggregate keyed elements in memory and be able to work with the state using valueState and ListState.
- Union of multiple streams with coProcessFunction matching the elements by the Key.
- Flink working with Tumbling Windows, sliding windows and global windows setting an alert trigger based on the current state for each keyed element.
- Architected the Big Data Architecture to create the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
- Created a POC involved in loading data from LINUX on premises ecosystem to Amazon S3 using Redshift, DynamoDB, and HDFS using MongoDB and CassandraDB (Hortonworks Hadoop).
- Ensured HIPPA compliance and security of sensitive data using hashing, MD5 SQL encryption, Kerberos.
- Implemented Spark using Scala, and utilized DataFrames and Spark SQL API for faster processing of data.
- Worked on AWS to create, manage EC2 instances, and Hadoop Clusters.
- Used secure VLANS for data transfer security to secure VPC on AWS.
- Followed ITIL best practices for ensure data and infrastructure integrity.
- Followed Six Sigma for process efficiency and quality performance.
- Created Data Modeling and implemented Redshift instance on Amazon.
- Developed shell scripting to automate the data flow of daily tasks.
- Created both internal and external tables in Hive, and developed Pig scripts to preprocess the data for analysis.
- Involved in converting HiveQL/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Use of spark, python, hive, pig in constructing pipelines and queries.
- Used Cassandra and MongoDB to work on JSON files.
- Worked on Cassandra query language to load the bulk of data and execute queries.
- Designed appropriate partitioning/bucketing schema to allow faster data retrieval during analysis using Hive.
- Performance tuned Spark jobs for setting batch interval time, level of parallelism, and memory tuning.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the real-time data using Bedrock data management tool.
- Involved in migrating jobs to Spark, using Spark SQL and DataFrames API to load structured data into Spark clusters
- Architected the Big Data Architecture to create the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake using Cloudera Platform.
- Created a POC involved in loading data from LINUX file system to Cloudera Platform and HDFS.
- Developed shell scripting to automate the data flow of daily tasks.
- Created both internal and external tables in Hive, and developed Pig scripts to preprocess the data for analysis.
- Used Cassandra to work on JSON documented data.
- Worked on Cassandra query language to load the bulk of data and execute queries.
- Involved in converting HiveQL/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Designed appropriate partitioning/bucketing schema to allow faster data retrieval during analysis using Hive.
- Performance tuned Spark jobs for setting batch interval time, level of parallelism, and memory tuning.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the real-time data using Bedrock data management tool.
- Involved in migrating jobs to Spark, using Spark SQL and DataFrames API to load structured data into Spark clusters
- Involved in converting HiveQL/SQL queries into Spark transformations.
- Involved in transforming the relational database to legacy labels to HDFS, and HBASE tables using Sqoop and vice versa.
- Created Hive tables and dynamic partitions, with buckets for sampling and working on them using Hive QL.
- Created HBase tables to store variable data formats of data coming from different portfolios.
- Implemented Spark using Scala, and utilized DataFrames and Spark SQL API for faster processing of data.
- Used Sqoop job to import the data from RDBMS using Incremental Import.
- Exported analyzed data to relational databases using Sqoop for visualization, and to generate reports for the BI team.
- Wrote shell scripts for exporting log files to Hadoop cluster through automated processes.
- Developed Scala scripts and UDFs using DataFrames and RDD in Spark for data aggregation, queries and writing data back into OLTP system through Sqoop.
- Worked with various compression techniques to save data and optimize data transfer over network using Lzo, Snappy, etc.
- Involved in writing incremental imports into Hive tables.
- Worked on importing and exporting tera bytes of data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Importing and Exporting data into HDFS using Sqoop.
- Worked on AWS to create, manage EC2 instances, and Hadoop Clusters.
- Created Data Modeling and implemented Redshift instance on Amazon.
- Developed shell scripting to automate the data flow of daily tasks.
- Created both internal and external tables in Hive, and developed Pig scripts to preprocess the data for analysis.
- Transformed the logs data into data model using Pig and written UDF functions to format the logs data.
- Experienced on loading and transforming of large sets of structured and semi structured data from HDFS through Sqoop and placed in HDFS for further processing.
- Involved in transforming data from legacy tables to HDFS, and HBase tables using Sqoop.
- Extensively used transformations like Router, Aggregator, Normalizer, Filter, Joiner, Expression, Source Qualifier, Unconnected and connected lookup, Update strategy and store procedure, XML transformations along with error handling and performance tuning.
- Designed appropriate partitioning/bucketing schema to allow faster data retrieval during analysis using Hive.
- Designed jobs using DB2 UDB, ODBC, .Net, Join, Merge, Lookup, Remove duplicate, Copy, Filter, Funnel, Dataset, Lookup file set, Change data capture, Modify, Row merger, Aggregator and Peek, Row generator stages.
- Performance tuned Spark jobs for setting batch interval time, level of parallelism, and memory tuning.
- Involved in converting HiveQL/SQL queries into Spark transformations.
- Migrated Big Data Architecture to cloud created on AWS using AWS tools and database instances with Hadoop HDFS to create a data lake in cloud.
- Used HBase to store majority of data which needed to be divided based on region.
- Involved in benchmarking Hadoop and Spark cluster on a TeraSort application in AWS.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Wrote Spark codes to run a sorting application on the data stored on AWS.
- Deployed the application jar files into AWS instances.
- Used the image files of an instance to create instances containing Hadoop installed and running.
- Developed a task execution framework on EC2 instances using SQS and DynamoDB.
- Developed Scala scripts and UDFs using DataFrames and RDD in Spark for data aggregation, queries and writing data back into OLTP system through Sqoop.
- Designed a cost-effective archival platform for storing big data using Hadoop and its related technologies. Created Data Modeling and implemented Redshift instance on Amazon.
- Designed appropriate partitioning/bucketing schema to allow faster data retrieval during analysis using Hive.
- Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
- Implemented data ingestion and cluster handling in real time processing using Kafka.
- Implemented workflows using Apache Oozie framework to automate tasks. Used Spark-Streaming APIs to perform necessary transformations and actions on the real-time data using Bedrock data management tool.
- Used Spark SQL and DataFrames API to load structured data into Spark clusters
- Extensively worked on performance optimization of hive queries by using map-side join, parallel execution and cost based optimization.
- Developed shell scripting to automate the data flow of daily tasks.
- Created both internal and external tables in Hive, and developed Pig scripts to preprocess the data for analysis.
- Designed appropriate partitioning/bucketing schema to allow faster data retrieval during analysis using Hive.
- Performance tuned Spark jobs for setting batch interval time, level of parallelism, and memory tuning.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used SCALA to store streaming data to HDFS and to implement Spark for faster processing of data.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the real-time data using Bedrock data management tool.
- Involved in migrating MapReduce jobs to Spark, using Spark SQL and DataFrames API to load structured data into Spark clusters
- Involved in converting HiveQL/SQL queries into Spark transformations.
- Executed tasks for upgrading clusters on the staging platform before doing it on production cluster.
- Performed maintenance, monitoring, deployments, and upgrades across infrastructure that supports all Hadoop clusters.
- Installed and configured various components of the Hadoop ecosystem.
- Optimized HIVE analytics, SQL queries, created tables, views, wrote custom UDFs, and Hive-based exception processing.
- Involved in transforming the relational database to legacy labels to HDFS, and HBASE tables using Sqoop and vice versa.
- Replaced default Derby metadata storage system for Hive with MySQL system.
- Set-up QA environment and updated configurations for implementing scripts with Pig.
- Configured Fair Scheduler to allocate resources to all the applications across the cluster.
- Developed custom FTP adaptors to pull the clickstream data from FTP servers to HDFS directly using HDFS File System API.
- Transitioned Windows server from USA to México taking operations from customer infrastructure to the current environment.
- Automated the inventory for more than one thousand Windows servers
- Transitioned and transformed projects from the customer to the T-systems standards